llvm/docs/AMDGPUUsage.rst

   1 =============================
   2 User Guide for AMDGPU Backend
   3 =============================
   4
   5 .. contents::
   6    :local:
   7
   8 .. toctree::
   9    :hidden:
  10
  11    AMDGPU/AMDGPUAsmGFX7
  12    AMDGPU/AMDGPUAsmGFX8
  13    AMDGPU/AMDGPUAsmGFX9
  14    AMDGPU/AMDGPUAsmGFX900
  15    AMDGPU/AMDGPUAsmGFX904
  16    AMDGPU/AMDGPUAsmGFX906
  17    AMDGPU/AMDGPUAsmGFX908
  18    AMDGPU/AMDGPUAsmGFX90a
  19    AMDGPU/AMDGPUAsmGFX940
  20    AMDGPU/AMDGPUAsmGFX10
  21    AMDGPU/AMDGPUAsmGFX1011
  22    AMDGPU/AMDGPUAsmGFX1013
  23    AMDGPU/AMDGPUAsmGFX1030
  24    AMDGPU/AMDGPUAsmGFX11
  25    AMDGPUModifierSyntax
  26    AMDGPUOperandSyntax
  27    AMDGPUInstructionSyntax
  28    AMDGPUInstructionNotation
  29    AMDGPUDwarfExtensionsForHeterogeneousDebugging
  30    AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
  31
  32 Introduction
  33 ============
  34
  35 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
  36 R600 family up until the current GCN families. It lives in the
  37 ``llvm/lib/Target/AMDGPU`` directory.
  38
  39 LLVM
  40 ====
  41
  42 .. _amdgpu-target-triples:
  43
  44 Target Triples
  45 --------------
  46
  47 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
  48 to specify the target triple:
  49
  50   .. table:: AMDGPU Architectures
  51      :name: amdgpu-architecture-table
  52
  53      ============ ==============================================================
  54      Architecture Description
  55      ============ ==============================================================
  56      ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
  57      ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
  58      ============ ==============================================================
  59
  60   .. table:: AMDGPU Vendors
  61      :name: amdgpu-vendor-table
  62
  63      ============ ==============================================================
  64      Vendor       Description
  65      ============ ==============================================================
  66      ``amd``      Can be used for all AMD GPU usage.
  67      ``mesa3d``   Can be used if the OS is ``mesa3d``.
  68      ============ ==============================================================
  69
  70   .. table:: AMDGPU Operating Systems
  71      :name: amdgpu-os
  72
  73      ============== ============================================================
  74      OS             Description
  75      ============== ============================================================
  76      *<empty>*      Defaults to the *unknown* OS.
  77      ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
  78                     such as:
  79
  80                     - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
  81                       loader on Linux. See *AMD ROCm Platform Release Notes*
  82                       [AMD-ROCm-Release-Notes]_ for supported hardware and
  83                       software.
  84                     - AMD's PAL runtime using the *pal-amdhsa* loader on
  85                       Windows.
  86
  87      ``amdpal``     Graphic shaders and compute kernels executed on AMD's PAL
  88                     runtime using the *pal-amdpal* loader on Windows and Linux
  89                     Pro.
  90      ``mesa3d``     Graphic shaders and compute kernels executed on AMD's Mesa
  91                     3D runtime using the *mesa-mesa3d* loader on Linux.
  92      ============== ============================================================
  93
  94   .. table:: AMDGPU Environments
  95      :name: amdgpu-environment-table
  96
  97      ============ ==============================================================
  98      Environment  Description
  99      ============ ==============================================================
 100      *<empty>*    Default.
 101      ============ ==============================================================
 102
 103 .. _amdgpu-processors:
 104
 105 Processors
 106 ----------
 107
 108 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
 109 specify the AMDGPU processor together with optional target features. See
 110 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
 111 specific information.
 112
 113 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
 114
 115 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
 116
 117
 118   .. table:: AMDGPU Processors
 119      :name: amdgpu-processor-table
 120
 121      =========== =============== ============ ===== ================= =============== =============== ======================
 122      Processor   Alternative     Target       dGPU/ Target            Target          OS Support      Example
 123                  Processor       Triple       APU   Features          Properties      *(see*          Products
 124                                  Architecture       Supported                         `amdgpu-os`_
 125                                                                                       *and
 126                                                                                       corresponding
 127                                                                                       runtime release
 128                                                                                       notes for
 129                                                                                       current
 130                                                                                       information and
 131                                                                                       level of
 132                                                                                       support)*
 133      =========== =============== ============ ===== ================= =============== =============== ======================
 134      **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
 135      -----------------------------------------------------------------------------------------------------------------------
 136      ``r600``                    ``r600``     dGPU                    - Does not
 137                                                                         support
 138                                                                         generic
 139                                                                         address
 140                                                                         space
 141      ``r630``                    ``r600``     dGPU                    - Does not
 142                                                                         support
 143                                                                         generic
 144                                                                         address
 145                                                                         space
 146      ``rs880``                   ``r600``     dGPU                    - Does not
 147                                                                         support
 148                                                                         generic
 149                                                                         address
 150                                                                         space
 151      ``rv670``                   ``r600``     dGPU                    - Does not
 152                                                                         support
 153                                                                         generic
 154                                                                         address
 155                                                                         space
 156      **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
 157      -----------------------------------------------------------------------------------------------------------------------
 158      ``rv710``                   ``r600``     dGPU                    - Does not
 159                                                                         support
 160                                                                         generic
 161                                                                         address
 162                                                                         space
 163      ``rv730``                   ``r600``     dGPU                    - Does not
 164                                                                         support
 165                                                                         generic
 166                                                                         address
 167                                                                         space
 168      ``rv770``                   ``r600``     dGPU                    - Does not
 169                                                                         support
 170                                                                         generic
 171                                                                         address
 172                                                                         space
 173      **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
 174      -----------------------------------------------------------------------------------------------------------------------
 175      ``cedar``                   ``r600``     dGPU                    - Does not
 176                                                                         support
 177                                                                         generic
 178                                                                         address
 179                                                                         space
 180      ``cypress``                 ``r600``     dGPU                    - Does not
 181                                                                         support
 182                                                                         generic
 183                                                                         address
 184                                                                         space
 185      ``juniper``                 ``r600``     dGPU                    - Does not
 186                                                                         support
 187                                                                         generic
 188                                                                         address
 189                                                                         space
 190      ``redwood``                 ``r600``     dGPU                    - Does not
 191                                                                         support
 192                                                                         generic
 193                                                                         address
 194                                                                         space
 195      ``sumo``                    ``r600``     dGPU                    - Does not
 196                                                                         support
 197                                                                         generic
 198                                                                         address
 199                                                                         space
 200      **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
 201      -----------------------------------------------------------------------------------------------------------------------
 202      ``barts``                   ``r600``     dGPU                    - Does not
 203                                                                         support
 204                                                                         generic
 205                                                                         address
 206                                                                         space
 207      ``caicos``                  ``r600``     dGPU                    - Does not
 208                                                                         support
 209                                                                         generic
 210                                                                         address
 211                                                                         space
 212      ``cayman``                  ``r600``     dGPU                    - Does not
 213                                                                         support
 214                                                                         generic
 215                                                                         address
 216                                                                         space
 217      ``turks``                   ``r600``     dGPU                    - Does not
 218                                                                         support
 219                                                                         generic
 220                                                                         address
 221                                                                         space
 222      **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
 223      -----------------------------------------------------------------------------------------------------------------------
 224      ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
 225                                                                         support
 226                                                                         generic
 227                                                                         address
 228                                                                         space
 229      ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
 230                  - ``verde``                                            support
 231                                                                         generic
 232                                                                         address
 233                                                                         space
 234      ``gfx602``  - ``hainan``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
 235                  - ``oland``                                            support
 236                                                                         generic
 237                                                                         address
 238                                                                         space
 239      **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
 240      -----------------------------------------------------------------------------------------------------------------------
 241      ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - Offset        - *rocm-amdhsa* - A6-7000
 242                                                                         flat          - *pal-amdhsa*  - A6 Pro-7050B
 243                                                                         scratch       - *pal-amdpal*  - A8-7100
 244                                                                                                       - A8 Pro-7150B
 245                                                                                                       - A10-7300
 246                                                                                                       - A10 Pro-7350B
 247                                                                                                       - FX-7500
 248                                                                                                       - A8-7200P
 249                                                                                                       - A10-7400P
 250                                                                                                       - FX-7600P
 251      ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro W8100
 252                                                                         flat          - *pal-amdhsa*  - FirePro W9100
 253                                                                         scratch       - *pal-amdpal*  - FirePro S9150
 254                                                                                                       - FirePro S9170
 255      ``gfx702``                  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 290
 256                                                                         flat          - *pal-amdhsa*  - Radeon R9 290x
 257                                                                         scratch       - *pal-amdpal*  - Radeon R390
 258                                                                                                       - Radeon R390x
 259      ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  - E1-2100
 260                  - ``mullins``                                          flat          - *pal-amdpal*  - E1-2200
 261                                                                         scratch                       - E1-2500
 262                                                                                                       - E2-3000
 263                                                                                                       - E2-3800
 264                                                                                                       - A4-5000
 265                                                                                                       - A4-5100
 266                                                                                                       - A6-5200
 267                                                                                                       - A4 Pro-3340B
 268      ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Offset        - *pal-amdhsa*  - Radeon HD 7790
 269                                                                         flat          - *pal-amdpal*  - Radeon HD 8770
 270                                                                         scratch                       - R7 260
 271                                                                                                       - R7 260X
 272      ``gfx705``                  ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  *TBA*
 273                                                                         flat          - *pal-amdpal*
 274                                                                         scratch                       .. TODO::
 275
 276                                                                                                         Add product
 277                                                                                                         names.
 278
 279      **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
 280      -----------------------------------------------------------------------------------------------------------------------
 281      ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* - A6-8500P
 282                                                                         flat          - *pal-amdhsa*  - Pro A6-8500B
 283                                                                         scratch       - *pal-amdpal*  - A8-8600P
 284                                                                                                       - Pro A8-8600B
 285                                                                                                       - FX-8800P
 286                                                                                                       - Pro A12-8800B
 287                                                                                                       - A10-8700P
 288                                                                                                       - Pro A10-8700B
 289                                                                                                       - A10-8780P
 290                                                                                                       - A10-9600P
 291                                                                                                       - A10-9630P
 292                                                                                                       - A12-9700P
 293                                                                                                       - A12-9730P
 294                                                                                                       - FX-9800P
 295                                                                                                       - FX-9830P
 296                                                                                                       - E2-9010
 297                                                                                                       - A6-9210
 298                                                                                                       - A9-9410
 299      ``gfx802``  - ``iceland``   ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 285
 300                  - ``tonga``                                            flat          - *pal-amdhsa*  - Radeon R9 380
 301                                                                         scratch       - *pal-amdpal*  - Radeon R9 385
 302      ``gfx803``  - ``fiji``      ``amdgcn``   dGPU                                    - *rocm-amdhsa* - Radeon R9 Nano
 303                                                                                       - *pal-amdhsa*  - Radeon R9 Fury
 304                                                                                       - *pal-amdpal*  - Radeon R9 FuryX
 305                                                                                                       - Radeon Pro Duo
 306                                                                                                       - FirePro S9300x2
 307                                                                                                       - Radeon Instinct MI8
 308      \           - ``polaris10`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 470
 309                                                                         flat          - *pal-amdhsa*  - Radeon RX 480
 310                                                                         scratch       - *pal-amdpal*  - Radeon Instinct MI6
 311      \           - ``polaris11`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 460
 312                                                                         flat          - *pal-amdhsa*
 313                                                                         scratch       - *pal-amdpal*
 314      ``gfx805``  - ``tongapro``  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro S7150
 315                                                                         flat          - *pal-amdhsa*  - FirePro S7100
 316                                                                         scratch       - *pal-amdpal*  - FirePro W7100
 317                                                                                                       - Mobile FirePro
 318                                                                                                         M7170
 319      ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* *TBA*
 320                                                                         flat          - *pal-amdhsa*
 321                                                                         scratch       - *pal-amdpal*  .. TODO::
 322
 323                                                                                                         Add product
 324                                                                                                         names.
 325
 326      **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_
 327      -----------------------------------------------------------------------------------------------------------------------
 328      ``gfx900``                  ``amdgcn``   dGPU  - xnack           - Absolute      - *rocm-amdhsa* - Radeon Vega
 329                                                                         flat          - *pal-amdhsa*    Frontier Edition
 330                                                                         scratch       - *pal-amdpal*  - Radeon RX Vega 56
 331                                                                                                       - Radeon RX Vega 64
 332                                                                                                       - Radeon RX Vega 64
 333                                                                                                         Liquid
 334                                                                                                       - Radeon Instinct MI25
 335      ``gfx902``                  ``amdgcn``   APU   - xnack           - Absolute      - *rocm-amdhsa* - Ryzen 3 2200G
 336                                                                         flat          - *pal-amdhsa*  - Ryzen 5 2400G
 337                                                                         scratch       - *pal-amdpal*
 338      ``gfx904``                  ``amdgcn``   dGPU  - xnack                           - *rocm-amdhsa* *TBA*
 339                                                                                       - *pal-amdhsa*
 340                                                                                       - *pal-amdpal*  .. TODO::
 341
 342                                                                                                         Add product
 343                                                                                                         names.
 344
 345      ``gfx906``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* - Radeon Instinct MI50
 346                                                     - xnack             flat          - *pal-amdhsa*  - Radeon Instinct MI60
 347                                                                         scratch       - *pal-amdpal*  - Radeon VII
 348                                                                                                       - Radeon Pro VII
 349      ``gfx908``                  ``amdgcn``   dGPU  - sramecc                         - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
 350                                                     - xnack           - Absolute
 351                                                                         flat
 352                                                                         scratch
 353      ``gfx909``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  *TBA*
 354                                                                         flat
 355                                                                         scratch                       .. TODO::
 356
 357                                                                                                         Add product
 358                                                                                                         names.
 359
 360      ``gfx90a``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* *TBA*
 361                                                     - tgsplit           flat
 362                                                     - xnack             scratch                       .. TODO::
 363                                                                       - Packed
 364                                                                         work-item                       Add product
 365                                                                         IDs                             names.
 366
 367      ``gfx90c``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  - Ryzen 7 4700G
 368                                                                         flat                          - Ryzen 7 4700GE
 369                                                                         scratch                       - Ryzen 5 4600G
 370                                                                                                       - Ryzen 5 4600GE
 371                                                                                                       - Ryzen 3 4300G
 372                                                                                                       - Ryzen 3 4300GE
 373                                                                                                       - Ryzen Pro 4000G
 374                                                                                                       - Ryzen 7 Pro 4700G
 375                                                                                                       - Ryzen 7 Pro 4750GE
 376                                                                                                       - Ryzen 5 Pro 4650G
 377                                                                                                       - Ryzen 5 Pro 4650GE
 378                                                                                                       - Ryzen 3 Pro 4350G
 379                                                                                                       - Ryzen 3 Pro 4350GE
 380
 381      ``gfx940``                  ``amdgcn``   dGPU  - sramecc         - Architected                   *TBA*
 382                                                     - tgsplit           flat
 383                                                     - xnack             scratch                       .. TODO::
 384                                                                       - Packed
 385                                                                         work-item                       Add product
 386                                                                         IDs                             names.
 387
 388      ``gfx941``                  ``amdgcn``   dGPU  - sramecc         - Architected                   *TBA*
 389                                                     - tgsplit           flat
 390                                                     - xnack             scratch                       .. TODO::
 391                                                                       - Packed
 392                                                                         work-item                       Add product
 393                                                                         IDs                             names.
 394
 395      ``gfx942``                  ``amdgcn``   dGPU  - sramecc         - Architected                   *TBA*
 396                                                     - tgsplit           flat
 397                                                     - xnack             scratch                       .. TODO::
 398                                                                       - Packed
 399                                                                         work-item                       Add product
 400                                                                         IDs                             names.
 401
 402      **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
 403      -----------------------------------------------------------------------------------------------------------------------
 404      ``gfx1010``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5700
 405                                                     - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5700 XT
 406                                                     - xnack             scratch       - *pal-amdpal*  - Radeon Pro 5600 XT
 407                                                                                                       - Radeon Pro 5600M
 408      ``gfx1011``                 ``amdgcn``   dGPU  - cumode                          - *rocm-amdhsa* - Radeon Pro V520
 409                                                     - wavefrontsize64 - Absolute      - *pal-amdhsa*
 410                                                     - xnack             flat          - *pal-amdpal*
 411                                                                         scratch
 412      ``gfx1012``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5500
 413                                                     - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5500 XT
 414                                                     - xnack             scratch       - *pal-amdpal*
 415      ``gfx1013``                 ``amdgcn``   APU   - cumode          - Absolute      - *rocm-amdhsa* *TBA*
 416                                                     - wavefrontsize64   flat          - *pal-amdhsa*
 417                                                     - xnack             scratch       - *pal-amdpal*  .. TODO::
 418
 419                                                                                                         Add product
 420                                                                                                         names.
 421
 422      **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
 423      -----------------------------------------------------------------------------------------------------------------------
 424      ``gfx1030``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6800
 425                                                     - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 6800 XT
 426                                                                         scratch       - *pal-amdpal*  - Radeon RX 6900 XT
 427      ``gfx1031``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6700 XT
 428                                                     - wavefrontsize64   flat          - *pal-amdhsa*
 429                                                                         scratch       - *pal-amdpal*
 430      ``gfx1032``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* *TBA*
 431                                                     - wavefrontsize64   flat          - *pal-amdhsa*
 432                                                                         scratch       - *pal-amdpal*  .. TODO::
 433
 434                                                                                                         Add product
 435                                                                                                         names.
 436
 437      ``gfx1033``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
 438                                                     - wavefrontsize64   flat
 439                                                                         scratch                       .. TODO::
 440
 441                                                                                                         Add product
 442                                                                                                         names.
 443      ``gfx1034``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *pal-amdpal*  *TBA*
 444                                                     - wavefrontsize64   flat
 445                                                                         scratch                       .. TODO::
 446
 447                                                                                                         Add product
 448                                                                                                         names.
 449
 450      ``gfx1035``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
 451                                                     - wavefrontsize64   flat
 452                                                                         scratch                       .. TODO::
 453                                                                                                         Add product
 454                                                                                                         names.
 455
 456      ``gfx1036``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
 457                                                     - wavefrontsize64   flat
 458                                                                         scratch                       .. TODO::
 459
 460                                                                                                         Add product
 461                                                                                                         names.
 462
 463      **GCN GFX11 (RDNA 3)** [AMD-GCN-GFX11-RDNA3]_
 464      -----------------------------------------------------------------------------------------------------------------------
 465      ``gfx1100``                 ``amdgcn``   dGPU  - cumode          - Architected   - *pal-amdpal*  *TBA*
 466                                                     - wavefrontsize64   flat
 467                                                                         scratch                       .. TODO::
 468                                                                       - Packed
 469                                                                         work-item                       Add product
 470                                                                         IDs                             names.
 471
 472      ``gfx1101``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
 473                                                     - wavefrontsize64   flat
 474                                                                         scratch                       .. TODO::
 475                                                                       - Packed
 476                                                                         work-item                       Add product
 477                                                                         IDs                             names.
 478
 479      ``gfx1102``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
 480                                                     - wavefrontsize64   flat
 481                                                                         scratch                       .. TODO::
 482                                                                       - Packed
 483                                                                         work-item                       Add product
 484                                                                         IDs                             names.
 485
 486      ``gfx1103``                 ``amdgcn``   APU   - cumode          - Architected                   *TBA*
 487                                                     - wavefrontsize64   flat
 488                                                                         scratch                       .. TODO::
 489                                                                       - Packed
 490                                                                         work-item                       Add product
 491                                                                         IDs                             names.
 492
 493      =========== =============== ============ ===== ================= =============== =============== ======================
 494
 495 .. _amdgpu-target-features:
 496
 497 Target Features
 498 ---------------
 499
 500 Target features control how code is generated to support certain
 501 processor specific features. Not all target features are supported by
 502 all processors. The runtime must ensure that the features supported by
 503 the device used to execute the code match the features enabled when
 504 generating the code. A mismatch of features may result in incorrect
 505 execution, or a reduction in performance.
 506
 507 The target features supported by each processor is listed in
 508 :ref:`amdgpu-processor-table`.
 509
 510 Target features are controlled by exactly one of the following Clang
 511 options:
 512
 513 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
 514
 515   The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
 516   optional components of the target ID. If omitted, the target feature has the
 517   ``any`` value. See :ref:`amdgpu-target-id`.
 518
 519 ``-m[no-]<target-feature>``
 520
 521   Target features not specified by the target ID are specified using a
 522   separate option. These target features can have an ``on`` or ``off``
 523   value.  ``on`` is specified by omitting the ``no-`` prefix, and
 524   ``off`` is specified by including the ``no-`` prefix. The default
 525   if not specified is ``off``.
 526
 527 For example:
 528
 529 ``-mcpu=gfx908:xnack+``
 530   Enable the ``xnack`` feature.
 531 ``-mcpu=gfx908:xnack-``
 532   Disable the ``xnack`` feature.
 533 ``-mcumode``
 534   Enable the ``cumode`` feature.
 535 ``-mno-cumode``
 536   Disable the ``cumode`` feature.
 537
 538   .. table:: AMDGPU Target Features
 539      :name: amdgpu-target-features-table
 540
 541      =============== ============================ ==================================================
 542      Target Feature  Clang Option to Control      Description
 543      Name
 544      =============== ============================ ==================================================
 545      cumode          - ``-m[no-]cumode``          Control the wavefront execution mode used
 546                                                   when generating code for kernels. When disabled
 547                                                   native WGP wavefront execution mode is used,
 548                                                   when enabled CU wavefront execution mode is used
 549                                                   (see :ref:`amdgpu-amdhsa-memory-model`).
 550
 551      sramecc         - ``-mcpu``                  If specified, generate code that can only be
 552                      - ``--offload-arch``         loaded and executed in a process that has a
 553                                                   matching setting for SRAMECC.
 554
 555                                                   If not specified for code object V2 to V3, generate
 556                                                   code that can be loaded and executed in a process
 557                                                   with SRAMECC enabled.
 558
 559                                                   If not specified for code object V4 or above, generate
 560                                                   code that can be loaded and executed in a process
 561                                                   with either setting of SRAMECC.
 562
 563      tgsplit           ``-m[no-]tgsplit``         Enable/disable generating code that assumes
 564                                                   work-groups are launched in threadgroup split mode.
 565                                                   When enabled the waves of a work-group may be
 566                                                   launched in different CUs.
 567
 568      wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
 569                                                   generating code for kernels. When disabled
 570                                                   native wavefront size 32 is used, when enabled
 571                                                   wavefront size 64 is used.
 572
 573      xnack           - ``-mcpu``                  If specified, generate code that can only be
 574                      - ``--offload-arch``         loaded and executed in a process that has a
 575                                                   matching setting for XNACK replay.
 576
 577                                                   If not specified for code object V2 to V3, generate
 578                                                   code that can be loaded and executed in a process
 579                                                   with XNACK replay enabled.
 580
 581                                                   If not specified for code object V4 or above, generate
 582                                                   code that can be loaded and executed in a process
 583                                                   with either setting of XNACK replay.
 584
 585                                                   XNACK replay can be used for demand paging and
 586                                                   page migration. If enabled in the device, then if
 587                                                   a page fault occurs the code may execute
 588                                                   incorrectly unless generated with XNACK replay
 589                                                   enabled, or generated for code object V4 or above without
 590                                                   specifying XNACK replay. Executing code that was
 591                                                   generated with XNACK replay enabled, or generated
 592                                                   for code object V4 or above without specifying XNACK replay,
 593                                                   on a device that does not have XNACK replay
 594                                                   enabled will execute correctly but may be less
 595                                                   performant than code generated for XNACK replay
 596                                                   disabled.
 597      =============== ============================ ==================================================
 598
 599 .. _amdgpu-target-id:
 600
 601 Target ID
 602 ---------
 603
 604 AMDGPU supports target IDs. See `Clang Offload Bundler
 605 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
 606 description. The AMDGPU target specific information is:
 607
 608 **processor**
 609   Is an AMDGPU processor or alternative processor name specified in
 610   :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
 611   the primary processor and alternative processor names. The canonical form
 612   target ID only allow the primary processor name.
 613
 614 **target-feature**
 615   Is a target feature name specified in :ref:`amdgpu-target-features-table` that
 616   is supported by the processor. The target features supported by each processor
 617   is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
 618   a target ID are marked as being controlled by ``-mcpu`` and
 619   ``--offload-arch``. Each target feature must appear at most once in a target
 620   ID. The non-canonical form target ID allows the target features to be
 621   specified in any order. The canonical form target ID requires the target
 622   features to be specified in alphabetic order.
 623
 624 .. _amdgpu-target-id-v2-v3:
 625
 626 Code Object V2 to V3 Target ID
 627 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 628
 629 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
 630 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
 631 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
 632 directive and the bundle entry ID. In those cases it has the following BNF
 633 syntax:
 634
 635 .. code::
 636
 637   <target-id> ::== <processor> ( "+" <target-feature> )*
 638
 639 Where a target feature is omitted if *Off* and present if *On* or *Any*.
 640
 641 .. note::
 642
 643   The code object V2 to V3 cannot represent *Any* and treats it the same as
 644   *On*.
 645
 646 .. _amdgpu-embedding-bundled-objects:
 647
 648 Embedding Bundled Code Objects
 649 ------------------------------
 650
 651 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
 652 as described in `Clang Offload Bundler
 653 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
 654
 655 .. note::
 656
 657   The target ID syntax used for code object V2 to V3 for a bundle entry ID
 658   differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
 659
 660 .. _amdgpu-address-spaces:
 661
 662 Address Spaces
 663 --------------
 664
 665 The AMDGPU architecture supports a number of memory address spaces. The address
 666 space names use the OpenCL standard names, with some additions.
 667
 668 The AMDGPU address spaces correspond to target architecture specific LLVM
 669 address space numbers used in LLVM IR.
 670
 671 The AMDGPU address spaces are described in
 672 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
 673 supported for the ``amdgcn`` target.
 674
 675   .. table:: AMDGPU Address Spaces
 676      :name: amdgpu-address-spaces-table
 677
 678      ================================= =============== =========== ================ ======= ============================
 679      ..                                                                                     64-Bit Process Address Space
 680      --------------------------------- --------------- ----------- ---------------- ------------------------------------
 681      Address Space Name                LLVM IR Address HSA Segment Hardware         Address NULL Value
 682                                        Space Number    Name        Name             Size
 683      ================================= =============== =========== ================ ======= ============================
 684      Generic                           0               flat        flat             64      0x0000000000000000
 685      Global                            1               global      global           64      0x0000000000000000
 686      Region                            2               N/A         GDS              32      *not implemented for AMDHSA*
 687      Local                             3               group       LDS              32      0xFFFFFFFF
 688      Constant                          4               constant    *same as global* 64      0x0000000000000000
 689      Private                           5               private     scratch          32      0xFFFFFFFF
 690      Constant 32-bit                   6               *TODO*                               0x00000000
 691      Buffer Fat Pointer (experimental) 7               *TODO*
 692      Buffer Resource (experimental)    8               *TODO*
 693      Streamout Registers               128             N/A         GS_REGS
 694      ================================= =============== =========== ================ ======= ============================
 695
 696 **Generic**
 697   The generic address space is supported unless the *Target Properties* column
 698   of :ref:`amdgpu-processor-table` specifies *Does not support generic address
 699   space*.
 700
 701   The generic address space uses the hardware flat address support for two fixed
 702   ranges of virtual addresses (the private and local apertures), that are
 703   outside the range of addressable global memory, to map from a flat address to
 704   a private or local address. This uses FLAT instructions that can take a flat
 705   address and access global, private (scratch), and group (LDS) memory depending
 706   on if the address is within one of the aperture ranges.
 707
 708   Flat access to scratch requires hardware aperture setup and setup in the
 709   kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
 710   access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
 711   setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
 712
 713   To convert between a private or group address space address (termed a segment
 714   address) and a flat address the base address of the corresponding aperture
 715   can be used. For GFX7-GFX8 these are available in the
 716   :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
 717   Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
 718   GFX9-GFX11 the aperture base addresses are directly available as inline
 719   constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
 720   In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
 721   aligned to 2^32 which makes it easier to convert from flat to segment or
 722   segment to flat.
 723
 724   A global address space address has the same value when used as a flat address
 725   so no conversion is needed.
 726
 727 **Global and Constant**
 728   The global and constant address spaces both use global virtual addresses,
 729   which are the same virtual address space used by the CPU. However, some
 730   virtual addresses may only be accessible to the CPU, some only accessible
 731   by the GPU, and some by both.
 732
 733   Using the constant address space indicates that the data will not change
 734   during the execution of the kernel. This allows scalar read instructions to
 735   be used. As the constant address space could only be modified on the host
 736   side, a generic pointer loaded from the constant address space is safe to be
 737   assumed as a global pointer since only the device global memory is visible
 738   and managed on the host side. The vector and scalar L1 caches are invalidated
 739   of volatile data before each kernel dispatch execution to allow constant
 740   memory to change values between kernel dispatches.
 741
 742 **Region**
 743   The region address space uses the hardware Global Data Store (GDS). All
 744   wavefronts executing on the same device will access the same memory for any
 745   given region address. However, the same region address accessed by wavefronts
 746   executing on different devices will access different memory. It is higher
 747   performance than global memory. It is allocated by the runtime. The data
 748   store (DS) instructions can be used to access it.
 749
 750 **Local**
 751   The local address space uses the hardware Local Data Store (LDS) which is
 752   automatically allocated when the hardware creates the wavefronts of a
 753   work-group, and freed when all the wavefronts of a work-group have
 754   terminated. All wavefronts belonging to the same work-group will access the
 755   same memory for any given local address. However, the same local address
 756   accessed by wavefronts belonging to different work-groups will access
 757   different memory. It is higher performance than global memory. The data store
 758   (DS) instructions can be used to access it.
 759
 760 **Private**
 761   The private address space uses the hardware scratch memory support which
 762   automatically allocates memory when it creates a wavefront and frees it when
 763   a wavefronts terminates. The memory accessed by a lane of a wavefront for any
 764   given private address will be different to the memory accessed by another lane
 765   of the same or different wavefront for the same private address.
 766
 767   If a kernel dispatch uses scratch, then the hardware allocates memory from a
 768   pool of backing memory allocated by the runtime for each wavefront. The lanes
 769   of the wavefront access this using dword (4 byte) interleaving. The mapping
 770   used from private address to backing memory address is:
 771
 772     ``wavefront-scratch-base +
 773     ((private-address / 4) * wavefront-size * 4) +
 774     (wavefront-lane-id * 4) + (private-address % 4)``
 775
 776   If each lane of a wavefront accesses the same private address, the
 777   interleaving results in adjacent dwords being accessed and hence requires
 778   fewer cache lines to be fetched.
 779
 780   There are different ways that the wavefront scratch base address is
 781   determined by a wavefront (see
 782   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 783
 784   Scratch memory can be accessed in an interleaved manner using buffer
 785   instructions with the scratch buffer descriptor and per wavefront scratch
 786   offset, by the scratch instructions, or by flat instructions. Multi-dword
 787   access is not supported except by flat and scratch instructions in
 788   GFX9-GFX11.
 789
 790 **Constant 32-bit**
 791   *TODO*
 792
 793 **Buffer Fat Pointer**
 794   The buffer fat pointer is an experimental address space that is currently
 795   unsupported in the backend. It exposes a non-integral pointer that is in
 796   the future intended to support the modelling of 128-bit buffer descriptors
 797   plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
 798   *pointer*), allowing normal LLVM load/store/atomic operations to be used to
 799   model the buffer descriptors used heavily in graphics workloads targeting
 800   the backend.
 801
 802   The buffer descriptor used to construct a buffer fat pointer must be *raw*:
 803   the stride must be 0, the "add tid" flag bust be 0, the swizzle enable bits
 804   must be off, and the extent must be measured in bytes. (On subtargets where
 805   bounds checking may be disabled, buffer fat pointers may choose to enable
 806   it or not).
 807
 808 **Buffer Resource**
 809   The buffer resource is an experimental address space that is currently unsupported
 810   in the backend. It exposes a non-integral pointer that will represent a 128-bit
 811   buffer descriptor resource.
 812
 813   Since, in general, a buffer resource supports complex addressing modes that cannot
 814   be easily represented in LLVM (such as implicit swizzled access to structured
 815   buffers), it is **illegal** to perform non-trivial address computations, such as
 816   ``getelementptr`` operations, on buffer resources. They may be passed to
 817   AMDGPU buffer intrinsics, and they may be converted to and from ``i128``.
 818
 819   Casting a buffer resource to a buffer fat pointer is permitted and adds an offset
 820   of 0.
 821
 822 **Streamout Registers**
 823   Dedicated registers used by the GS NGG Streamout Instructions. The register
 824   file is modelled as a memory in a distinct address space because it is indexed
 825   by an address-like offset in place of named registers, and because register
 826   accesses affect LGKMcnt. This is an internal address space used only by the
 827   compiler. Do not use this address space for IR pointers.
 828
 829 .. _amdgpu-memory-scopes:
 830
 831 Memory Scopes
 832 -------------
 833
 834 This section provides LLVM memory synchronization scopes supported by the AMDGPU
 835 backend memory model when the target triple OS is ``amdhsa`` (see
 836 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 837
 838 The memory model supported is based on the HSA memory model [HSA]_ which is
 839 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 840 relation is transitive over the synchronizes-with relation independent of scope
 841 and synchronizes-with allows the memory scope instances to be inclusive (see
 842 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
 843
 844 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 845 inclusion and requires the memory scopes to exactly match. However, this
 846 is conservatively correct for OpenCL.
 847
 848   .. table:: AMDHSA LLVM Sync Scopes
 849      :name: amdgpu-amdhsa-llvm-sync-scopes-table
 850
 851      ======================= ===================================================
 852      LLVM Sync Scope         Description
 853      ======================= ===================================================
 854      *none*                  The default: ``system``.
 855
 856                              Synchronizes with, and participates in modification
 857                              and seq_cst total orderings with, other operations
 858                              (except image operations) for all address spaces
 859                              (except private, or generic that accesses private)
 860                              provided the other operation's sync scope is:
 861
 862                              - ``system``.
 863                              - ``agent`` and executed by a thread on the same
 864                                agent.
 865                              - ``workgroup`` and executed by a thread in the
 866                                same work-group.
 867                              - ``wavefront`` and executed by a thread in the
 868                                same wavefront.
 869
 870      ``agent``               Synchronizes with, and participates in modification
 871                              and seq_cst total orderings with, other operations
 872                              (except image operations) for all address spaces
 873                              (except private, or generic that accesses private)
 874                              provided the other operation's sync scope is:
 875
 876                              - ``system`` or ``agent`` and executed by a thread
 877                                on the same agent.
 878                              - ``workgroup`` and executed by a thread in the
 879                                same work-group.
 880                              - ``wavefront`` and executed by a thread in the
 881                                same wavefront.
 882
 883      ``workgroup``           Synchronizes with, and participates in modification
 884                              and seq_cst total orderings with, other operations
 885                              (except image operations) for all address spaces
 886                              (except private, or generic that accesses private)
 887                              provided the other operation's sync scope is:
 888
 889                              - ``system``, ``agent`` or ``workgroup`` and
 890                                executed by a thread in the same work-group.
 891                              - ``wavefront`` and executed by a thread in the
 892                                same wavefront.
 893
 894      ``wavefront``           Synchronizes with, and participates in modification
 895                              and seq_cst total orderings with, other operations
 896                              (except image operations) for all address spaces
 897                              (except private, or generic that accesses private)
 898                              provided the other operation's sync scope is:
 899
 900                              - ``system``, ``agent``, ``workgroup`` or
 901                                ``wavefront`` and executed by a thread in the
 902                                same wavefront.
 903
 904      ``singlethread``        Only synchronizes with and participates in
 905                              modification and seq_cst total orderings with,
 906                              other operations (except image operations) running
 907                              in the same thread for all address spaces (for
 908                              example, in signal handlers).
 909
 910      ``one-as``              Same as ``system`` but only synchronizes with other
 911                              operations within the same address space.
 912
 913      ``agent-one-as``        Same as ``agent`` but only synchronizes with other
 914                              operations within the same address space.
 915
 916      ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
 917                              other operations within the same address space.
 918
 919      ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
 920                              other operations within the same address space.
 921
 922      ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
 923                              other operations within the same address space.
 924      ======================= ===================================================
 925
 926 LLVM IR Intrinsics
 927 ------------------
 928
 929 The AMDGPU backend implements the following LLVM IR intrinsics.
 930
 931 *This section is WIP.*
 932
 933 .. TODO::
 934
 935    List AMDGPU intrinsics.
 936
 937 LLVM IR Attributes
 938 ------------------
 939
 940 The AMDGPU backend supports the following LLVM IR attributes.
 941
 942   .. table:: AMDGPU LLVM IR Attributes
 943      :name: amdgpu-llvm-ir-attributes-table
 944
 945      ======================================= ==========================================================
 946      LLVM Attribute                          Description
 947      ======================================= ==========================================================
 948      "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
 949                                              will be specified when the kernel is dispatched. Generated
 950                                              by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
 951                                              The implied default value is 1,1024.
 952
 953      "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
 954                                              argument block size for the implicit arguments. This
 955                                              varies by OS and language (for OpenCL see
 956                                              :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
 957      "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
 958                                              the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
 959      "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
 960                                              ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
 961      "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
 962                                              execution unit. Generated by the ``amdgpu_waves_per_eu``
 963                                              CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
 964                                              and the backend may not be able to satisfy the request. If
 965                                              the specified range is incompatible with the function's
 966                                              "amdgpu-flat-work-group-size" value, the implied occupancy
 967                                              bounds by the workgroup size takes precedence.
 968
 969      "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
 970                                              mode register to be set on entry. Overrides the default for
 971                                              the calling convention.
 972      "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
 973                                              the mode register to be set on entry. Overrides the default
 974                                              for the calling convention.
 975
 976      "amdgpu-no-workitem-id-x"               Indicates the function does not depend on the value of the
 977                                              llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
 978                                              attribute, or reached through a call site marked with this attribute,
 979                                              the value returned by the intrinsic is undefined. The backend can
 980                                              generally infer this during code generation, so typically there is no
 981                                              benefit to frontends marking functions with this.
 982
 983      "amdgpu-no-workitem-id-y"               The same as amdgpu-no-workitem-id-x, except for the
 984                                              llvm.amdgcn.workitem.id.y intrinsic.
 985
 986      "amdgpu-no-workitem-id-z"               The same as amdgpu-no-workitem-id-x, except for the
 987                                              llvm.amdgcn.workitem.id.z intrinsic.
 988
 989      "amdgpu-no-workgroup-id-x"              The same as amdgpu-no-workitem-id-x, except for the
 990                                              llvm.amdgcn.workgroup.id.x intrinsic.
 991
 992      "amdgpu-no-workgroup-id-y"              The same as amdgpu-no-workitem-id-x, except for the
 993                                              llvm.amdgcn.workgroup.id.y intrinsic.
 994
 995      "amdgpu-no-workgroup-id-z"              The same as amdgpu-no-workitem-id-x, except for the
 996                                              llvm.amdgcn.workgroup.id.z intrinsic.
 997
 998      "amdgpu-no-dispatch-ptr"                The same as amdgpu-no-workitem-id-x, except for the
 999                                              llvm.amdgcn.dispatch.ptr intrinsic.
1000
1001      "amdgpu-no-implicitarg-ptr"             The same as amdgpu-no-workitem-id-x, except for the
1002                                              llvm.amdgcn.implicitarg.ptr intrinsic.
1003
1004      "amdgpu-no-dispatch-id"                 The same as amdgpu-no-workitem-id-x, except for the
1005                                              llvm.amdgcn.dispatch.id intrinsic.
1006
1007      "amdgpu-no-queue-ptr"                   Similar to amdgpu-no-workitem-id-x, except for the
1008                                              llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
1009                                              attributes, the queue pointer may be required in situations where the
1010                                              intrinsic call does not directly appear in the program. Some subtargets
1011                                              require the queue pointer for to handle some addrspacecasts, as well
1012                                              as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
1013                                              llvm.debug intrinsics.
1014
1015      "amdgpu-no-hostcall-ptr"                Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1016                                              kernel argument that holds the pointer to the hostcall buffer. If this
1017                                              attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1018
1019      "amdgpu-no-heap-ptr"                    Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1020                                              kernel argument that holds the pointer to an initialized memory buffer
1021                                              that conforms to the requirements of the malloc/free device library V1
1022                                              version implementation. If this attribute is absent, then the
1023                                              amdgpu-no-implicitarg-ptr is also removed.
1024
1025      "amdgpu-no-multigrid-sync-arg"          Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1026                                              kernel argument that holds the multigrid synchronization pointer. If this
1027                                              attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1028
1029      "amdgpu-no-default-queue"               Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1030                                              kernel argument that holds the default queue pointer. If this
1031                                              attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1032
1033      "amdgpu-no-completion-action"           Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1034                                              kernel argument that holds the completion action pointer. If this
1035                                              attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1036
1037      ======================================= ==========================================================
1038
1039 .. _amdgpu-elf-code-object:
1040
1041 ELF Code Object
1042 ===============
1043
1044 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
1045 can be linked by ``lld`` to produce a standard ELF shared code object which can
1046 be loaded and executed on an AMDGPU target.
1047
1048 .. _amdgpu-elf-header:
1049
1050 Header
1051 ------
1052
1053 The AMDGPU backend uses the following ELF header:
1054
1055   .. table:: AMDGPU ELF Header
1056      :name: amdgpu-elf-header-table
1057
1058      ========================== ===============================
1059      Field                      Value
1060      ========================== ===============================
1061      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
1062      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
1063      ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
1064                                 - ``ELFOSABI_AMDGPU_HSA``
1065                                 - ``ELFOSABI_AMDGPU_PAL``
1066                                 - ``ELFOSABI_AMDGPU_MESA3D``
1067      ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1068                                 - ``ELFABIVERSION_AMDGPU_HSA_V3``
1069                                 - ``ELFABIVERSION_AMDGPU_HSA_V4``
1070                                 - ``ELFABIVERSION_AMDGPU_HSA_V5``
1071                                 - ``ELFABIVERSION_AMDGPU_PAL``
1072                                 - ``ELFABIVERSION_AMDGPU_MESA3D``
1073      ``e_type``                 - ``ET_REL``
1074                                 - ``ET_DYN``
1075      ``e_machine``              ``EM_AMDGPU``
1076      ``e_entry``                0
1077      ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1078                                 :ref:`amdgpu-elf-header-e_flags-table-v3`,
1079                                 and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
1080      ========================== ===============================
1081
1082 ..
1083
1084   .. table:: AMDGPU ELF Header Enumeration Values
1085      :name: amdgpu-elf-header-enumeration-values-table
1086
1087      =============================== =====
1088      Name                            Value
1089      =============================== =====
1090      ``EM_AMDGPU``                   224
1091      ``ELFOSABI_NONE``               0
1092      ``ELFOSABI_AMDGPU_HSA``         64
1093      ``ELFOSABI_AMDGPU_PAL``         65
1094      ``ELFOSABI_AMDGPU_MESA3D``      66
1095      ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1096      ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1097      ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1098      ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1099      ``ELFABIVERSION_AMDGPU_PAL``    0
1100      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1101      =============================== =====
1102
1103 ``e_ident[EI_CLASS]``
1104   The ELF class is:
1105
1106   * ``ELFCLASS32`` for ``r600`` architecture.
1107
1108   * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1109     process address space applications.
1110
1111 ``e_ident[EI_DATA]``
1112   All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1113
1114 ``e_ident[EI_OSABI]``
1115   One of the following AMDGPU target architecture specific OS ABIs
1116   (see :ref:`amdgpu-os`):
1117
1118   * ``ELFOSABI_NONE`` for *unknown* OS.
1119
1120   * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1121
1122   * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1123
1124   * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1125
1126 ``e_ident[EI_ABIVERSION]``
1127   The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1128   object conforms:
1129
1130   * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1131     runtime ABI for code object V2. Specify using the Clang option
1132     ``-mcode-object-version=2``.
1133
1134   * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1135     runtime ABI for code object V3. Specify using the Clang option
1136     ``-mcode-object-version=3``.
1137
1138   * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1139     runtime ABI for code object V4. Specify using the Clang option
1140     ``-mcode-object-version=4``. This is the default code object
1141     version if not specified.
1142
1143   * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1144     runtime ABI for code object V5. Specify using the Clang option
1145     ``-mcode-object-version=5``.
1146
1147   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1148     runtime ABI.
1149
1150   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1151     3D runtime ABI.
1152
1153 ``e_type``
1154   Can be one of the following values:
1155
1156
1157   ``ET_REL``
1158     The type produced by the AMDGPU backend compiler as it is relocatable code
1159     object.
1160
1161   ``ET_DYN``
1162     The type produced by the linker as it is a shared code object.
1163
1164   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1165
1166 ``e_machine``
1167   The value ``EM_AMDGPU`` is used for the machine for all processors supported
1168   by the ``r600`` and ``amdgcn`` architectures (see
1169   :ref:`amdgpu-processor-table`). The specific processor is specified in the
1170   ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1171   :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1172   ``e_flags`` for code object V3 and above (see
1173   :ref:`amdgpu-elf-header-e_flags-table-v3` and
1174   :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
1175
1176 ``e_entry``
1177   The entry point is 0 as the entry points for individual kernels must be
1178   selected in order to invoke them through AQL packets.
1179
1180 ``e_flags``
1181   The AMDGPU backend uses the following ELF header flags:
1182
1183   .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1184      :name: amdgpu-elf-header-e_flags-v2-table
1185
1186      ===================================== ===== =============================
1187      Name                                  Value Description
1188      ===================================== ===== =============================
1189      ``EF_AMDGPU_FEATURE_XNACK_V2``        0x01  Indicates if the ``xnack``
1190                                                  target feature is
1191                                                  enabled for all code
1192                                                  contained in the code object.
1193                                                  If the processor
1194                                                  does not support the
1195                                                  ``xnack`` target
1196                                                  feature then must
1197                                                  be 0.
1198                                                  See
1199                                                  :ref:`amdgpu-target-features`.
1200      ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02  Indicates if the trap
1201                                                  handler is enabled for all
1202                                                  code contained in the code
1203                                                  object. If the processor
1204                                                  does not support a trap
1205                                                  handler then must be 0.
1206                                                  See
1207                                                  :ref:`amdgpu-target-features`.
1208      ===================================== ===== =============================
1209
1210   .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1211      :name: amdgpu-elf-header-e_flags-table-v3
1212
1213      ================================= ===== =============================
1214      Name                              Value Description
1215      ================================= ===== =============================
1216      ``EF_AMDGPU_MACH``                0x0ff AMDGPU processor selection
1217                                              mask for
1218                                              ``EF_AMDGPU_MACH_xxx`` values
1219                                              defined in
1220                                              :ref:`amdgpu-ef-amdgpu-mach-table`.
1221      ``EF_AMDGPU_FEATURE_XNACK_V3``    0x100 Indicates if the ``xnack``
1222                                              target feature is
1223                                              enabled for all code
1224                                              contained in the code object.
1225                                              If the processor
1226                                              does not support the
1227                                              ``xnack`` target
1228                                              feature then must
1229                                              be 0.
1230                                              See
1231                                              :ref:`amdgpu-target-features`.
1232      ``EF_AMDGPU_FEATURE_SRAMECC_V3``  0x200 Indicates if the ``sramecc``
1233                                              target feature is
1234                                              enabled for all code
1235                                              contained in the code object.
1236                                              If the processor
1237                                              does not support the
1238                                              ``sramecc`` target
1239                                              feature then must
1240                                              be 0.
1241                                              See
1242                                              :ref:`amdgpu-target-features`.
1243      ================================= ===== =============================
1244
1245   .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
1246      :name: amdgpu-elf-header-e_flags-table-v4-onwards
1247
1248      ============================================ ===== ===================================
1249      Name                                         Value      Description
1250      ============================================ ===== ===================================
1251      ``EF_AMDGPU_MACH``                           0x0ff AMDGPU processor selection
1252                                                         mask for
1253                                                         ``EF_AMDGPU_MACH_xxx`` values
1254                                                         defined in
1255                                                         :ref:`amdgpu-ef-amdgpu-mach-table`.
1256      ``EF_AMDGPU_FEATURE_XNACK_V4``               0x300 XNACK selection mask for
1257                                                         ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1258                                                         values.
1259      ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4``   0x000 XNACK unsupported.
1260      ``EF_AMDGPU_FEATURE_XNACK_ANY_V4``           0x100 XNACK can have any value.
1261      ``EF_AMDGPU_FEATURE_XNACK_OFF_V4``           0x200 XNACK disabled.
1262      ``EF_AMDGPU_FEATURE_XNACK_ON_V4``            0x300 XNACK enabled.
1263      ``EF_AMDGPU_FEATURE_SRAMECC_V4``             0xc00 SRAMECC selection mask for
1264                                                         ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1265                                                         values.
1266      ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsupported.
1267      ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4``         0x400 SRAMECC can have any value.
1268      ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4``         0x800 SRAMECC disabled,
1269      ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4``          0xc00 SRAMECC enabled.
1270      ============================================ ===== ===================================
1271
1272   .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1273      :name: amdgpu-ef-amdgpu-mach-table
1274
1275      ==================================== ========== =============================
1276      Name                                 Value      Description (see
1277                                                      :ref:`amdgpu-processor-table`)
1278      ==================================== ========== =============================
1279      ``EF_AMDGPU_MACH_NONE``              0x000      *not specified*
1280      ``EF_AMDGPU_MACH_R600_R600``         0x001      ``r600``
1281      ``EF_AMDGPU_MACH_R600_R630``         0x002      ``r630``
1282      ``EF_AMDGPU_MACH_R600_RS880``        0x003      ``rs880``
1283      ``EF_AMDGPU_MACH_R600_RV670``        0x004      ``rv670``
1284      ``EF_AMDGPU_MACH_R600_RV710``        0x005      ``rv710``
1285      ``EF_AMDGPU_MACH_R600_RV730``        0x006      ``rv730``
1286      ``EF_AMDGPU_MACH_R600_RV770``        0x007      ``rv770``
1287      ``EF_AMDGPU_MACH_R600_CEDAR``        0x008      ``cedar``
1288      ``EF_AMDGPU_MACH_R600_CYPRESS``      0x009      ``cypress``
1289      ``EF_AMDGPU_MACH_R600_JUNIPER``      0x00a      ``juniper``
1290      ``EF_AMDGPU_MACH_R600_REDWOOD``      0x00b      ``redwood``
1291      ``EF_AMDGPU_MACH_R600_SUMO``         0x00c      ``sumo``
1292      ``EF_AMDGPU_MACH_R600_BARTS``        0x00d      ``barts``
1293      ``EF_AMDGPU_MACH_R600_CAICOS``       0x00e      ``caicos``
1294      ``EF_AMDGPU_MACH_R600_CAYMAN``       0x00f      ``cayman``
1295      ``EF_AMDGPU_MACH_R600_TURKS``        0x010      ``turks``
1296      *reserved*                           0x011 -    Reserved for ``r600``
1297                                           0x01f      architecture processors.
1298      ``EF_AMDGPU_MACH_AMDGCN_GFX600``     0x020      ``gfx600``
1299      ``EF_AMDGPU_MACH_AMDGCN_GFX601``     0x021      ``gfx601``
1300      ``EF_AMDGPU_MACH_AMDGCN_GFX700``     0x022      ``gfx700``
1301      ``EF_AMDGPU_MACH_AMDGCN_GFX701``     0x023      ``gfx701``
1302      ``EF_AMDGPU_MACH_AMDGCN_GFX702``     0x024      ``gfx702``
1303      ``EF_AMDGPU_MACH_AMDGCN_GFX703``     0x025      ``gfx703``
1304      ``EF_AMDGPU_MACH_AMDGCN_GFX704``     0x026      ``gfx704``
1305      *reserved*                           0x027      Reserved.
1306      ``EF_AMDGPU_MACH_AMDGCN_GFX801``     0x028      ``gfx801``
1307      ``EF_AMDGPU_MACH_AMDGCN_GFX802``     0x029      ``gfx802``
1308      ``EF_AMDGPU_MACH_AMDGCN_GFX803``     0x02a      ``gfx803``
1309      ``EF_AMDGPU_MACH_AMDGCN_GFX810``     0x02b      ``gfx810``
1310      ``EF_AMDGPU_MACH_AMDGCN_GFX900``     0x02c      ``gfx900``
1311      ``EF_AMDGPU_MACH_AMDGCN_GFX902``     0x02d      ``gfx902``
1312      ``EF_AMDGPU_MACH_AMDGCN_GFX904``     0x02e      ``gfx904``
1313      ``EF_AMDGPU_MACH_AMDGCN_GFX906``     0x02f      ``gfx906``
1314      ``EF_AMDGPU_MACH_AMDGCN_GFX908``     0x030      ``gfx908``
1315      ``EF_AMDGPU_MACH_AMDGCN_GFX909``     0x031      ``gfx909``
1316      ``EF_AMDGPU_MACH_AMDGCN_GFX90C``     0x032      ``gfx90c``
1317      ``EF_AMDGPU_MACH_AMDGCN_GFX1010``    0x033      ``gfx1010``
1318      ``EF_AMDGPU_MACH_AMDGCN_GFX1011``    0x034      ``gfx1011``
1319      ``EF_AMDGPU_MACH_AMDGCN_GFX1012``    0x035      ``gfx1012``
1320      ``EF_AMDGPU_MACH_AMDGCN_GFX1030``    0x036      ``gfx1030``
1321      ``EF_AMDGPU_MACH_AMDGCN_GFX1031``    0x037      ``gfx1031``
1322      ``EF_AMDGPU_MACH_AMDGCN_GFX1032``    0x038      ``gfx1032``
1323      ``EF_AMDGPU_MACH_AMDGCN_GFX1033``    0x039      ``gfx1033``
1324      ``EF_AMDGPU_MACH_AMDGCN_GFX602``     0x03a      ``gfx602``
1325      ``EF_AMDGPU_MACH_AMDGCN_GFX705``     0x03b      ``gfx705``
1326      ``EF_AMDGPU_MACH_AMDGCN_GFX805``     0x03c      ``gfx805``
1327      ``EF_AMDGPU_MACH_AMDGCN_GFX1035``    0x03d      ``gfx1035``
1328      ``EF_AMDGPU_MACH_AMDGCN_GFX1034``    0x03e      ``gfx1034``
1329      ``EF_AMDGPU_MACH_AMDGCN_GFX90A``     0x03f      ``gfx90a``
1330      ``EF_AMDGPU_MACH_AMDGCN_GFX940``     0x040      ``gfx940``
1331      ``EF_AMDGPU_MACH_AMDGCN_GFX1100``    0x041      ``gfx1100``
1332      ``EF_AMDGPU_MACH_AMDGCN_GFX1013``    0x042      ``gfx1013``
1333      *reserved*                           0x043      Reserved.
1334      ``EF_AMDGPU_MACH_AMDGCN_GFX1103``    0x044      ``gfx1103``
1335      ``EF_AMDGPU_MACH_AMDGCN_GFX1036``    0x045      ``gfx1036``
1336      ``EF_AMDGPU_MACH_AMDGCN_GFX1101``    0x046      ``gfx1101``
1337      ``EF_AMDGPU_MACH_AMDGCN_GFX1102``    0x047      ``gfx1102``
1338      *reserved*                           0x048      Reserved.
1339      *reserved*                           0x049      Reserved.
1340      *reserved*                           0x04a      Reserved.
1341      ``EF_AMDGPU_MACH_AMDGCN_GFX941``     0x04b      ``gfx941``
1342      ``EF_AMDGPU_MACH_AMDGCN_GFX942``     0x04c      ``gfx942``
1343      ==================================== ========== =============================
1344
1345 Sections
1346 --------
1347
1348 An AMDGPU target ELF code object has the standard ELF sections which include:
1349
1350   .. table:: AMDGPU ELF Sections
1351      :name: amdgpu-elf-sections-table
1352
1353      ================== ================ =================================
1354      Name               Type             Attributes
1355      ================== ================ =================================
1356      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
1357      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1358      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
1359      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
1360      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1361      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1362      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1363      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
1364      ``.note``          ``SHT_NOTE``     *none*
1365      ``.rela``\ *name*  ``SHT_RELA``     *none*
1366      ``.rela.dyn``      ``SHT_RELA``     *none*
1367      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1368      ``.shstrtab``      ``SHT_STRTAB``   *none*
1369      ``.strtab``        ``SHT_STRTAB``   *none*
1370      ``.symtab``        ``SHT_SYMTAB``   *none*
1371      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1372      ================== ================ =================================
1373
1374 These sections have their standard meanings (see [ELF]_) and are only generated
1375 if needed.
1376
1377 ``.debug``\ *\**
1378   The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1379   information on the DWARF produced by the AMDGPU backend.
1380
1381 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1382   The standard sections used by a dynamic loader.
1383
1384 ``.note``
1385   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1386   backend.
1387
1388 ``.rela``\ *name*, ``.rela.dyn``
1389   For relocatable code objects, *name* is the name of the section that the
1390   relocation records apply. For example, ``.rela.text`` is the section name for
1391   relocation records associated with the ``.text`` section.
1392
1393   For linked shared code objects, ``.rela.dyn`` contains all the relocation
1394   records from each of the relocatable code object's ``.rela``\ *name* sections.
1395
1396   See :ref:`amdgpu-relocation-records` for the relocation records supported by
1397   the AMDGPU backend.
1398
1399 ``.text``
1400   The executable machine code for the kernels and functions they call. Generated
1401   as position independent code. See :ref:`amdgpu-code-conventions` for
1402   information on conventions used in the isa generation.
1403
1404 .. _amdgpu-note-records:
1405
1406 Note Records
1407 ------------
1408
1409 The AMDGPU backend code object contains ELF note records in the ``.note``
1410 section. The set of generated notes and their semantics depend on the code
1411 object version; see :ref:`amdgpu-note-records-v2` and
1412 :ref:`amdgpu-note-records-v3-onwards`.
1413
1414 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1415 must be generated after the ``name`` field to ensure the ``desc`` field is 4
1416 byte aligned. In addition, minimal zero-byte padding must be generated to
1417 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1418 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1419 alignment.
1420
1421 .. _amdgpu-note-records-v2:
1422
1423 Code Object V2 Note Records
1424 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1425
1426 .. warning::
1427   Code object V2 is not the default code object version emitted by
1428   this version of LLVM.
1429
1430 The AMDGPU backend code object uses the following ELF note record in the
1431 ``.note`` section when compiling for code object V2.
1432
1433 The note record vendor field is "AMD".
1434
1435 Additional note records may be present, but any which are not documented here
1436 are deprecated and should not be used.
1437
1438   .. table:: AMDGPU Code Object V2 ELF Note Records
1439      :name: amdgpu-elf-note-records-v2-table
1440
1441      ===== ===================================== ======================================
1442      Name  Type                                  Description
1443      ===== ===================================== ======================================
1444      "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION``    Code object version.
1445      "AMD" ``NT_AMD_HSA_HSAIL``                  HSAIL properties generated by the HSAIL
1446                                                  Finalizer and not the LLVM compiler.
1447      "AMD" ``NT_AMD_HSA_ISA_VERSION``            Target ISA version.
1448      "AMD" ``NT_AMD_HSA_METADATA``               Metadata null terminated string in
1449                                                  YAML [YAML]_ textual format.
1450      "AMD" ``NT_AMD_HSA_ISA_NAME``               Target ISA name.
1451      ===== ===================================== ======================================
1452
1453 ..
1454
1455   .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1456      :name: amdgpu-elf-note-record-enumeration-values-v2-table
1457
1458      ===================================== =====
1459      Name                                  Value
1460      ===================================== =====
1461      ``NT_AMD_HSA_CODE_OBJECT_VERSION``    1
1462      ``NT_AMD_HSA_HSAIL``                  2
1463      ``NT_AMD_HSA_ISA_VERSION``            3
1464      *reserved*                            4-9
1465      ``NT_AMD_HSA_METADATA``               10
1466      ``NT_AMD_HSA_ISA_NAME``               11
1467      ===================================== =====
1468
1469 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
1470   Specifies the code object version number. The description field has the
1471   following layout:
1472
1473   .. code:: c
1474
1475     struct amdgpu_hsa_note_code_object_version_s {
1476       uint32_t major_version;
1477       uint32_t minor_version;
1478     };
1479
1480   The ``major_version`` has a value less than or equal to 2.
1481
1482 ``NT_AMD_HSA_HSAIL``
1483   Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1484   field has the following layout:
1485
1486   .. code:: c
1487
1488     struct amdgpu_hsa_note_hsail_s {
1489       uint32_t hsail_major_version;
1490       uint32_t hsail_minor_version;
1491       uint8_t profile;
1492       uint8_t machine_model;
1493       uint8_t default_float_round;
1494     };
1495
1496 ``NT_AMD_HSA_ISA_VERSION``
1497   Specifies the target ISA version. The description field has the following layout:
1498
1499   .. code:: c
1500
1501     struct amdgpu_hsa_note_isa_s {
1502       uint16_t vendor_name_size;
1503       uint16_t architecture_name_size;
1504       uint32_t major;
1505       uint32_t minor;
1506       uint32_t stepping;
1507       char vendor_and_architecture_name[1];
1508     };
1509
1510   ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1511   vendor and architecture names respectively, including the NUL character.
1512
1513   ``vendor_and_architecture_name`` contains the NUL terminates string for the
1514   vendor, immediately followed by the NUL terminated string for the
1515   architecture.
1516
1517   This note record is used by the HSA runtime loader.
1518
1519   Code object V2 only supports a limited number of processors and has fixed
1520   settings for target features. See
1521   :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1522   processors and the corresponding target ID. In the table the note record ISA
1523   name is a concatenation of the vendor name, architecture name, major, minor,
1524   and stepping separated by a ":".
1525
1526   The target ID column shows the processor name and fixed target features used
1527   by the LLVM compiler. The LLVM compiler does not generate a
1528   ``NT_AMD_HSA_HSAIL`` note record.
1529
1530   A code object generated by the Finalizer also uses code object V2 and always
1531   generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1532   ``sramecc`` target feature is as shown in
1533   :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1534   target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1535   bit.
1536
1537 ``NT_AMD_HSA_ISA_NAME``
1538   Specifies the target ISA name as a non-NUL terminated string.
1539
1540   This note record is not used by the HSA runtime loader.
1541
1542   See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1543   V2's limited support of processors and fixed settings for target features.
1544
1545   See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1546   from the string to the corresponding target ID. If the ``xnack`` target
1547   feature is supported and enabled, the string produced by the LLVM compiler
1548   will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1549   instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1550
1551 ``NT_AMD_HSA_METADATA``
1552   Specifies extensible metadata associated with the code objects executed on HSA
1553   [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1554   target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1555   :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1556   metadata string.
1557
1558   .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1559      :name: amdgpu-elf-note-record-supported_processors-v2-table
1560
1561      ===================== ==========================
1562      Note Record ISA Name  Target ID
1563      ===================== ==========================
1564      ``AMD:AMDGPU:6:0:0``  ``gfx600``
1565      ``AMD:AMDGPU:6:0:1``  ``gfx601``
1566      ``AMD:AMDGPU:6:0:2``  ``gfx602``
1567      ``AMD:AMDGPU:7:0:0``  ``gfx700``
1568      ``AMD:AMDGPU:7:0:1``  ``gfx701``
1569      ``AMD:AMDGPU:7:0:2``  ``gfx702``
1570      ``AMD:AMDGPU:7:0:3``  ``gfx703``
1571      ``AMD:AMDGPU:7:0:4``  ``gfx704``
1572      ``AMD:AMDGPU:7:0:5``  ``gfx705``
1573      ``AMD:AMDGPU:8:0:0``  ``gfx802``
1574      ``AMD:AMDGPU:8:0:1``  ``gfx801:xnack+``
1575      ``AMD:AMDGPU:8:0:2``  ``gfx802``
1576      ``AMD:AMDGPU:8:0:3``  ``gfx803``
1577      ``AMD:AMDGPU:8:0:4``  ``gfx803``
1578      ``AMD:AMDGPU:8:0:5``  ``gfx805``
1579      ``AMD:AMDGPU:8:1:0``  ``gfx810:xnack+``
1580      ``AMD:AMDGPU:9:0:0``  ``gfx900:xnack-``
1581      ``AMD:AMDGPU:9:0:1``  ``gfx900:xnack+``
1582      ``AMD:AMDGPU:9:0:2``  ``gfx902:xnack-``
1583      ``AMD:AMDGPU:9:0:3``  ``gfx902:xnack+``
1584      ``AMD:AMDGPU:9:0:4``  ``gfx904:xnack-``
1585      ``AMD:AMDGPU:9:0:5``  ``gfx904:xnack+``
1586      ``AMD:AMDGPU:9:0:6``  ``gfx906:sramecc-:xnack-``
1587      ``AMD:AMDGPU:9:0:7``  ``gfx906:sramecc-:xnack+``
1588      ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1589      ===================== ==========================
1590
1591 .. _amdgpu-note-records-v3-onwards:
1592
1593 Code Object V3 and Above Note Records
1594 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1595
1596 The AMDGPU backend code object uses the following ELF note record in the
1597 ``.note`` section when compiling for code object V3 and above.
1598
1599 The note record vendor field is "AMDGPU".
1600
1601 Additional note records may be present, but any which are not documented here
1602 are deprecated and should not be used.
1603
1604   .. table:: AMDGPU Code Object V3 and Above ELF Note Records
1605      :name: amdgpu-elf-note-records-table-v3-onwards
1606
1607      ======== ============================== ======================================
1608      Name     Type                           Description
1609      ======== ============================== ======================================
1610      "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
1611                                              binary format.
1612      ======== ============================== ======================================
1613
1614 ..
1615
1616   .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
1617      :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
1618
1619      ============================== =====
1620      Name                           Value
1621      ============================== =====
1622      *reserved*                     0-31
1623      ``NT_AMDGPU_METADATA``         32
1624      ============================== =====
1625
1626 ``NT_AMDGPU_METADATA``
1627   Specifies extensible metadata associated with an AMDGPU code object. It is
1628   encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1629   :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
1630   :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
1631   :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
1632   ``amdhsa`` OS.
1633
1634 .. _amdgpu-symbols:
1635
1636 Symbols
1637 -------
1638
1639 Symbols include the following:
1640
1641   .. table:: AMDGPU ELF Symbols
1642      :name: amdgpu-elf-symbols-table
1643
1644      ===================== ================== ================ ==================
1645      Name                  Type               Section          Description
1646      ===================== ================== ================ ==================
1647      *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
1648                                               - ``.rodata``
1649                                               - ``.bss``
1650      *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
1651      *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
1652      *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
1653      ===================== ================== ================ ==================
1654
1655 Global variable
1656   Global variables both used and defined by the compilation unit.
1657
1658   If the symbol is defined in the compilation unit then it is allocated in the
1659   appropriate section according to if it has initialized data or is readonly.
1660
1661   If the symbol is external then its section is ``STN_UNDEF`` and the loader
1662   will resolve relocations using the definition provided by another code object
1663   or explicitly defined by the runtime.
1664
1665   If the symbol resides in local/group memory (LDS) then its section is the
1666   special processor specific section name ``SHN_AMDGPU_LDS``, and the
1667   ``st_value`` field describes alignment requirements as it does for common
1668   symbols.
1669
1670   .. TODO::
1671
1672      Add description of linked shared object symbols. Seems undefined symbols
1673      are marked as STT_NOTYPE.
1674
1675 Kernel descriptor
1676   Every HSA kernel has an associated kernel descriptor. It is the address of the
1677   kernel descriptor that is used in the AQL dispatch packet used to invoke the
1678   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1679   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1680
1681 Kernel entry point
1682   Every HSA kernel also has a symbol for its machine code entry point.
1683
1684 .. _amdgpu-relocation-records:
1685
1686 Relocation Records
1687 ------------------
1688
1689 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1690 relocatable fields are:
1691
1692 ``word32``
1693   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1694   alignment. These values use the same byte order as other word values in the
1695   AMDGPU architecture.
1696
1697 ``word64``
1698   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1699   alignment. These values use the same byte order as other word values in the
1700   AMDGPU architecture.
1701
1702 Following notations are used for specifying relocation calculations:
1703
1704 **A**
1705   Represents the addend used to compute the value of the relocatable field.
1706
1707 **G**
1708   Represents the offset into the global offset table at which the relocation
1709   entry's symbol will reside during execution.
1710
1711 **GOT**
1712   Represents the address of the global offset table.
1713
1714 **P**
1715   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1716   of the storage unit being relocated (computed using ``r_offset``).
1717
1718 **S**
1719   Represents the value of the symbol whose index resides in the relocation
1720   entry. Relocations not using this must specify a symbol index of
1721   ``STN_UNDEF``.
1722
1723 **B**
1724   Represents the base address of a loaded executable or shared object which is
1725   the difference between the ELF address and the actual load address.
1726   Relocations using this are only valid in executable or shared objects.
1727
1728 The following relocation types are supported:
1729
1730   .. table:: AMDGPU ELF Relocation Records
1731      :name: amdgpu-elf-relocation-records-table
1732
1733      ========================== ======= =====  ==========  ==============================
1734      Relocation Type            Kind    Value  Field       Calculation
1735      ========================== ======= =====  ==========  ==============================
1736      ``R_AMDGPU_NONE``                  0      *none*      *none*
1737      ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
1738                                 Dynamic
1739      ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
1740                                 Dynamic
1741      ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
1742                                 Dynamic
1743      ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
1744      ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
1745      ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
1746                                 Dynamic
1747      ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
1748      ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
1749      ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
1750      ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
1751      ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
1752      *reserved*                         12
1753      ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
1754      ``R_AMDGPU_REL16``         Static  14     ``word16``  ((S + A - P) - 4) / 4
1755      ========================== ======= =====  ==========  ==============================
1756
1757 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1758 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1759
1760 There is no current OS loader support for 32-bit programs and so
1761 ``R_AMDGPU_ABS32`` is not used.
1762
1763 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1764
1765 Loaded Code Object Path Uniform Resource Identifier (URI)
1766 ---------------------------------------------------------
1767
1768 The AMD GPU code object loader represents the path of the ELF shared object from
1769 which the code object was loaded as a textual Uniform Resource Identifier (URI).
1770 Note that the code object is the in memory loaded relocated form of the ELF
1771 shared object.  Multiple code objects may be loaded at different memory
1772 addresses in the same process from the same ELF shared object.
1773
1774 The loaded code object path URI syntax is defined by the following BNF syntax:
1775
1776 .. code::
1777
1778   code_object_uri ::== file_uri | memory_uri
1779   file_uri        ::== "file://" file_path [ range_specifier ]
1780   memory_uri      ::== "memory://" process_id range_specifier
1781   range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1782   file_path       ::== URI_ENCODED_OS_FILE_PATH
1783   process_id      ::== DECIMAL_NUMBER
1784   number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1785
1786 **number**
1787   Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1788   and octal values by "0".
1789
1790 **file_path**
1791   Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1792   every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1793   encoded as two uppercase hexadecimal digits proceeded by "%".  Directories in
1794   the path are separated by "/".
1795
1796 **offset**
1797   Is a 0-based byte offset to the start of the code object.  For a file URI, it
1798   is from the start of the file specified by the ``file_path``, and if omitted
1799   defaults to 0. For a memory URI, it is the memory address and is required.
1800
1801 **size**
1802   Is the number of bytes in the code object.  For a file URI, if omitted it
1803   defaults to the size of the file.  It is required for a memory URI.
1804
1805 **process_id**
1806   Is the identity of the process owning the memory.  For Linux it is the C
1807   unsigned integral decimal literal for the process ID (PID).
1808
1809 For example:
1810
1811 .. code::
1812
1813   file:///dir1/dir2/file1
1814   file:///dir3/dir4/file2#offset=0x2000&size=3000
1815   memory://1234#offset=0x20000&size=3000
1816
1817 .. _amdgpu-dwarf-debug-information:
1818
1819 DWARF Debug Information
1820 =======================
1821
1822 .. warning::
1823
1824    This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1825    is not currently fully implemented and is subject to change.
1826
1827 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1828 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
1829 object executable code and data to the source language constructs. It can be
1830 used by tools such as debuggers and profilers. It uses features defined in
1831 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1832 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1833
1834 This section defines the AMDGPU target architecture specific DWARF mappings.
1835
1836 .. _amdgpu-dwarf-register-identifier:
1837
1838 Register Identifier
1839 -------------------
1840
1841 This section defines the AMDGPU target architecture register numbers used in
1842 DWARF operation expressions (see DWARF Version 5 section 2.5 and
1843 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1844 instructions (see DWARF Version 5 section 6.4 and
1845 :ref:`amdgpu-dwarf-call-frame-information`).
1846
1847 A single code object can contain code for kernels that have different wavefront
1848 sizes. The vector registers and some scalar registers are based on the wavefront
1849 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1850 simplifies the consumer of the DWARF so that each register has a fixed size,
1851 rather than being dynamic according to the wavefront size mode. Similarly,
1852 distinct DWARF registers are defined for those registers that vary in size
1853 according to the process address size. This allows a consumer to treat a
1854 specific AMDGPU processor as a single architecture regardless of how it is
1855 configured at run time. The compiler explicitly specifies the DWARF registers
1856 that match the mode in which the code it is generating will be executed.
1857
1858 DWARF registers are encoded as numbers, which are mapped to architecture
1859 registers. The mapping for AMDGPU is defined in
1860 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1861 mapping.
1862
1863 .. table:: AMDGPU DWARF Register Mapping
1864    :name: amdgpu-dwarf-register-mapping-table
1865
1866    ============== ================= ======== ==================================
1867    DWARF Register AMDGPU Register   Bit Size Description
1868    ============== ================= ======== ==================================
1869    0              PC_32             32       Program Counter (PC) when
1870                                              executing in a 32-bit process
1871                                              address space. Used in the CFI to
1872                                              describe the PC of the calling
1873                                              frame.
1874    1              EXEC_MASK_32      32       Execution Mask Register when
1875                                              executing in wavefront 32 mode.
1876    2-15           *Reserved*                 *Reserved for highly accessed
1877                                              registers using DWARF shortcut.*
1878    16             PC_64             64       Program Counter (PC) when
1879                                              executing in a 64-bit process
1880                                              address space. Used in the CFI to
1881                                              describe the PC of the calling
1882                                              frame.
1883    17             EXEC_MASK_64      64       Execution Mask Register when
1884                                              executing in wavefront 64 mode.
1885    18-31          *Reserved*                 *Reserved for highly accessed
1886                                              registers using DWARF shortcut.*
1887    32-95          SGPR0-SGPR63      32       Scalar General Purpose
1888                                              Registers.
1889    96-127         *Reserved*                 *Reserved for frequently accessed
1890                                              registers using DWARF 1-byte ULEB.*
1891    128            STATUS            32       Status Register.
1892    129-511        *Reserved*                 *Reserved for future Scalar
1893                                              Architectural Registers.*
1894    512            VCC_32            32       Vector Condition Code Register
1895                                              when executing in wavefront 32
1896                                              mode.
1897    513-767        *Reserved*                 *Reserved for future Vector
1898                                              Architectural Registers when
1899                                              executing in wavefront 32 mode.*
1900    768            VCC_64            64       Vector Condition Code Register
1901                                              when executing in wavefront 64
1902                                              mode.
1903    769-1023       *Reserved*                 *Reserved for future Vector
1904                                              Architectural Registers when
1905                                              executing in wavefront 64 mode.*
1906    1024-1087      *Reserved*                 *Reserved for padding.*
1907    1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers.
1908    1130-1535      *Reserved*                 *Reserved for future Scalar
1909                                              General Purpose Registers.*
1910    1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
1911                                              when executing in wavefront 32
1912                                              mode.
1913    1792-2047      *Reserved*                 *Reserved for future Vector
1914                                              General Purpose Registers when
1915                                              executing in wavefront 32 mode.*
1916    2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
1917                                              when executing in wavefront 32
1918                                              mode.
1919    2304-2559      *Reserved*                 *Reserved for future Vector
1920                                              Accumulation Registers when
1921                                              executing in wavefront 32 mode.*
1922    2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
1923                                              when executing in wavefront 64
1924                                              mode.
1925    2816-3071      *Reserved*                 *Reserved for future Vector
1926                                              General Purpose Registers when
1927                                              executing in wavefront 64 mode.*
1928    3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
1929                                              when executing in wavefront 64
1930                                              mode.
1931    3328-3583      *Reserved*                 *Reserved for future Vector
1932                                              Accumulation Registers when
1933                                              executing in wavefront 64 mode.*
1934    ============== ================= ======== ==================================
1935
1936 The vector registers are represented as the full size for the wavefront. They
1937 are organized as consecutive dwords (32-bits), one per lane, with the dword at
1938 the least significant bit position corresponding to lane 0 and so forth. DWARF
1939 location expressions involving the ``DW_OP_LLVM_offset`` and
1940 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1941 register corresponding to the lane that is executing the current thread of
1942 execution in languages that are implemented using a SIMD or SIMT execution
1943 model.
1944
1945 If the wavefront size is 32 lanes then the wavefront 32 mode register
1946 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1947 mode register definitions are used. Some AMDGPU targets support executing in
1948 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1949 to the wavefront mode of the generated code will be used.
1950
1951 If code is generated to execute in a 32-bit process address space, then the
1952 32-bit process address space register definitions are used. If code is generated
1953 to execute in a 64-bit process address space, then the 64-bit process address
1954 space register definitions are used. The ``amdgcn`` target only supports the
1955 64-bit process address space.
1956
1957 .. _amdgpu-dwarf-memory-space-identifier:
1958
1959 Memory Space Identifier
1960 -----------------------
1961
1962 The DWARF memory space represents the source language memory space. See DWARF
1963 Version 5 section 2.12 which is updated by the *DWARF Extensions For
1964 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-memory-spaces`.
1965
1966 The DWARF memory space mapping used for AMDGPU is defined in
1967 :ref:`amdgpu-dwarf-memory-space-mapping-table`.
1968
1969 .. table:: AMDGPU DWARF Memory Space Mapping
1970    :name: amdgpu-dwarf-memory-space-mapping-table
1971
1972    =========================== ====== =================
1973    DWARF                              AMDGPU
1974    ---------------------------------- -----------------
1975    Memory Space Name           Value  Memory Space
1976    =========================== ====== =================
1977    ``DW_MSPACE_LLVM_none``     0x0000 Generic (Flat)
1978    ``DW_MSPACE_LLVM_global``   0x0001 Global
1979    ``DW_MSPACE_LLVM_constant`` 0x0002 Global
1980    ``DW_MSPACE_LLVM_group``    0x0003 Local (group/LDS)
1981    ``DW_MSPACE_LLVM_private``  0x0004 Private (Scratch)
1982    ``DW_MSPACE_AMDGPU_region`` 0x8000 Region (GDS)
1983    =========================== ====== =================
1984
1985 The DWARF memory space values defined in the *DWARF Extensions For Heterogeneous
1986 Debugging* section :ref:`amdgpu-dwarf-memory-spaces` are used.
1987
1988 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1989 available for use for the AMD extension for access to the hardware GDS memory
1990 which is scratchpad memory allocated per device.
1991
1992 For AMDGPU if no ``DW_AT_LLVM_memory_space`` attribute is present, then the
1993 default memory space of ``DW_MSPACE_LLVM_none`` is used.
1994
1995 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1996 mapping of DWARF memory spaces to DWARF address spaces, including address size
1997 and NULL value.
1998
1999 .. _amdgpu-dwarf-address-space-identifier:
2000
2001 Address Space Identifier
2002 ------------------------
2003
2004 DWARF address spaces correspond to target architecture specific linear
2005 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
2006 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-address-spaces`.
2007
2008 The DWARF address space mapping used for AMDGPU is defined in
2009 :ref:`amdgpu-dwarf-address-space-mapping-table`.
2010
2011 .. table:: AMDGPU DWARF Address Space Mapping
2012    :name: amdgpu-dwarf-address-space-mapping-table
2013
2014    ======================================= ===== ======= ======== ===================== =======================
2015    DWARF                                                          AMDGPU                Notes
2016    --------------------------------------- ----- ---------------- --------------------- -----------------------
2017    Address Space Name                      Value Address Bit Size LLVM IR Address Space
2018    --------------------------------------- ----- ------- -------- --------------------- -----------------------
2019    ..                                            64-bit  32-bit
2020                                                  process process
2021                                                  address address
2022                                                  space   space
2023    ======================================= ===== ======= ======== ===================== =======================
2024    ``DW_ASPACE_LLVM_none``                 0x00  64      32       Global                *default address space*
2025    ``DW_ASPACE_AMDGPU_generic``            0x01  64      32       Generic (Flat)
2026    ``DW_ASPACE_AMDGPU_region``             0x02  32      32       Region (GDS)
2027    ``DW_ASPACE_AMDGPU_local``              0x03  32      32       Local (group/LDS)
2028    *Reserved*                              0x04
2029    ``DW_ASPACE_AMDGPU_private_lane``       0x05  32      32       Private (Scratch)     *focused lane*
2030    ``DW_ASPACE_AMDGPU_private_wave``       0x06  32      32       Private (Scratch)     *unswizzled wavefront*
2031    ======================================= ===== ======= ======== ===================== =======================
2032
2033 See :ref:`amdgpu-address-spaces` for information on the AMDGPU LLVM IR address
2034 spaces including address size and NULL value.
2035
2036 The ``DW_ASPACE_LLVM_none`` address space is the default target architecture
2037 address space used in DWARF operations that do not specify an address space. It
2038 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
2039 related operations can refer to addresses in the program code.
2040
2041 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
2042 specify the flat address space. If the address corresponds to an address in the
2043 local address space, then it corresponds to the wavefront that is executing the
2044 focused thread of execution. If the address corresponds to an address in the
2045 private address space, then it corresponds to the lane that is executing the
2046 focused thread of execution for languages that are implemented using a SIMD or
2047 SIMT execution model.
2048
2049 .. note::
2050
2051   CUDA-like languages such as HIP that do not have address spaces in the
2052   language type system, but do allow variables to be allocated in different
2053   address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
2054   address space in the DWARF expression operations as the default address space
2055   is the global address space.
2056
2057 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2058 specify the local address space corresponding to the wavefront that is executing
2059 the focused thread of execution.
2060
2061 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2062 to specify the private address space corresponding to the lane that is executing
2063 the focused thread of execution for languages that are implemented using a SIMD
2064 or SIMT execution model.
2065
2066 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2067 to specify the unswizzled private address space corresponding to the wavefront
2068 that is executing the focused thread of execution. The wavefront view of private
2069 memory is the per wavefront unswizzled backing memory layout defined in
2070 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2071 location for the backing memory of the wavefront (namely the address is not
2072 offset by ``wavefront-scratch-base``). The following formula can be used to
2073 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2074 ``DW_ASPACE_AMDGPU_private_wave`` address:
2075
2076 ::
2077
2078   private-address-wavefront =
2079     ((private-address-lane / 4) * wavefront-size * 4) +
2080     (wavefront-lane-id * 4) + (private-address-lane % 4)
2081
2082 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2083 of the dwords for each lane starting with lane 0 is required, then this
2084 simplifies to:
2085
2086 ::
2087
2088   private-address-wavefront =
2089     private-address-lane * wavefront-size
2090
2091 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2092 complete spilled vector register back into a complete vector register in the
2093 CFI. The frame pointer can be a private lane address which is dword aligned,
2094 which can be shifted to multiply by the wavefront size, and then used to form a
2095 private wavefront address that gives a location for a contiguous set of dwords,
2096 one per lane, where the vector register dwords are spilled. The compiler knows
2097 the wavefront size since it generates the code. Note that the type of the
2098 address may have to be converted as the size of a
2099 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2100 ``DW_ASPACE_AMDGPU_private_wave`` address.
2101
2102 .. _amdgpu-dwarf-lane-identifier:
2103
2104 Lane identifier
2105 ---------------
2106
2107 DWARF lane identifies specify a target architecture lane position for hardware
2108 that executes in a SIMD or SIMT manner, and on which a source language maps its
2109 threads of execution onto those lanes. The DWARF lane identifier is pushed by
2110 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2111 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2112 section :ref:`amdgpu-dwarf-operation-expressions`.
2113
2114 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2115 wavefront. It is numbered from 0 to the wavefront size minus 1.
2116
2117 Operation Expressions
2118 ---------------------
2119
2120 DWARF expressions are used to compute program values and the locations of
2121 program objects. See DWARF Version 5 section 2.5 and
2122 :ref:`amdgpu-dwarf-operation-expressions`.
2123
2124 DWARF location descriptions describe how to access storage which includes memory
2125 and registers. When accessing storage on AMDGPU, bytes are ordered with least
2126 significant bytes first, and bits are ordered within bytes with least
2127 significant bits first.
2128
2129 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2130 unwinding vector registers that are spilled under the execution mask to memory:
2131 the zero-single location description is the vector register, and the one-single
2132 location description is the spilled memory location description. The
2133 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2134 memory location description.
2135
2136 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2137 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2138 controlled by the execution mask. An undefined location description together
2139 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2140 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2141
2142 Debugger Information Entry Attributes
2143 -------------------------------------
2144
2145 This section describes how certain debugger information entry attributes are
2146 used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2147 which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2148 :ref:`amdgpu-dwarf-low-level-information` and
2149 :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2150
2151 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2152
2153 ``DW_AT_LLVM_lane_pc``
2154 ~~~~~~~~~~~~~~~~~~~~~~
2155
2156 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2157 location of the separate lanes of a SIMT thread.
2158
2159 If the lane is an active lane then this will be the same as the current program
2160 location.
2161
2162 If the lane is inactive, but was active on entry to the subprogram, then this is
2163 the program location in the subprogram at which execution of the lane is
2164 conceptual positioned.
2165
2166 If the lane was not active on entry to the subprogram, then this will be the
2167 undefined location. A client debugger can check if the lane is part of a valid
2168 work-group by checking that the lane is in the range of the associated
2169 work-group within the grid, accounting for partial work-groups. If it is not,
2170 then the debugger can omit any information for the lane. Otherwise, the debugger
2171 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2172 calling subprogram until it finds a non-undefined location. Conceptually the
2173 lane only has the call frames that it has a non-undefined
2174 ``DW_AT_LLVM_lane_pc``.
2175
2176 The following example illustrates how the AMDGPU backend can generate a DWARF
2177 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2178 following subprogram pseudo code for a target with 64 lanes per wavefront.
2179
2180 .. code::
2181   :number-lines:
2182
2183   SUBPROGRAM X
2184   BEGIN
2185     a;
2186     IF (c1) THEN
2187       b;
2188       IF (c2) THEN
2189         c;
2190       ELSE
2191         d;
2192       ENDIF
2193       e;
2194     ELSE
2195       f;
2196     ENDIF
2197     g;
2198   END
2199
2200 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2201 execution mask (``EXEC``) to linearize the control flow. The condition is
2202 evaluated to make a mask of the lanes for which the condition evaluates to true.
2203 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2204 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2205 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2206 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2207 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2208 region. This is shown below. Other approaches are possible, but the basic
2209 concept is the same.
2210
2211 .. code::
2212   :number-lines:
2213
2214   $lex_start:
2215     a;
2216     %1 = EXEC
2217     %2 = c1
2218   $lex_1_start:
2219     EXEC = %1 & %2
2220   $if_1_then:
2221       b;
2222       %3 = EXEC
2223       %4 = c2
2224   $lex_1_1_start:
2225       EXEC = %3 & %4
2226   $lex_1_1_then:
2227         c;
2228       EXEC = ~EXEC & %3
2229   $lex_1_1_else:
2230         d;
2231       EXEC = %3
2232   $lex_1_1_end:
2233       e;
2234     EXEC = ~EXEC & %1
2235   $lex_1_else:
2236       f;
2237     EXEC = %1
2238   $lex_1_end:
2239     g;
2240   $lex_end:
2241
2242 To create the DWARF location list expression that defines the location
2243 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2244 pseudo instruction can be used to annotate the linearized control flow. This can
2245 be done by defining an artificial variable for the lane PC. The DWARF location
2246 list expression created for it is used as the value of the
2247 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2248
2249 A DWARF procedure is defined for each well nested structured control flow region
2250 which provides the conceptual lane program location for a lane if it is not
2251 active (namely it is divergent). The DWARF operation expression for each region
2252 conceptually inherits the value of the immediately enclosing region and modifies
2253 it according to the semantics of the region.
2254
2255 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2256 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2257 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2258 region since the ``THEN`` region has completed.
2259
2260 The lane PC artificial variable is assigned at each region transition. It uses
2261 the immediately enclosing region's DWARF procedure to compute the program
2262 location for each lane assuming they are divergent, and then modifies the result
2263 by inserting the current program location for each lane that the ``EXEC`` mask
2264 indicates is active.
2265
2266 By having separate DWARF procedures for each region, they can be reused to
2267 define the value for any nested region. This reduces the total size of the DWARF
2268 operation expressions.
2269
2270 The following provides an example using pseudo LLVM MIR.
2271
2272 .. code::
2273   :number-lines:
2274
2275   $lex_start:
2276     DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2277       DW_AT_name = "__uint64";
2278       DW_AT_byte_size = 8;
2279       DW_AT_encoding = DW_ATE_unsigned;
2280     ];
2281     DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2282       DW_AT_name = "__active_lane_pc";
2283       DW_AT_location = [
2284         DW_OP_regx PC;
2285         DW_OP_LLVM_extend 64, 64;
2286         DW_OP_regval_type EXEC, %uint_64;
2287         DW_OP_LLVM_select_bit_piece 64, 64;
2288       ];
2289     ];
2290     DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2291       DW_AT_name = "__divergent_lane_pc";
2292       DW_AT_location = [
2293         DW_OP_LLVM_undefined;
2294         DW_OP_LLVM_extend 64, 64;
2295       ];
2296     ];
2297     DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2298       DW_OP_call_ref %__divergent_lane_pc;
2299       DW_OP_call_ref %__active_lane_pc;
2300     ];
2301     a;
2302     %1 = EXEC;
2303     DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2304     %2 = c1;
2305   $lex_1_start:
2306     EXEC = %1 & %2;
2307   $lex_1_then:
2308       DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2309         DW_AT_name = "__divergent_lane_pc_1_then";
2310         DW_AT_location = DIExpression[
2311           DW_OP_call_ref %__divergent_lane_pc;
2312           DW_OP_addrx &lex_1_start;
2313           DW_OP_stack_value;
2314           DW_OP_LLVM_extend 64, 64;
2315           DW_OP_call_ref %__lex_1_save_exec;
2316           DW_OP_deref_type 64, %__uint_64;
2317           DW_OP_LLVM_select_bit_piece 64, 64;
2318         ];
2319       ];
2320       DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2321         DW_OP_call_ref %__divergent_lane_pc_1_then;
2322         DW_OP_call_ref %__active_lane_pc;
2323       ];
2324       b;
2325       %3 = EXEC;
2326       DBG_VALUE %3, %__lex_1_1_save_exec;
2327       %4 = c2;
2328   $lex_1_1_start:
2329       EXEC = %3 & %4;
2330   $lex_1_1_then:
2331         DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2332           DW_AT_name = "__divergent_lane_pc_1_1_then";
2333           DW_AT_location = DIExpression[
2334             DW_OP_call_ref %__divergent_lane_pc_1_then;
2335             DW_OP_addrx &lex_1_1_start;
2336             DW_OP_stack_value;
2337             DW_OP_LLVM_extend 64, 64;
2338             DW_OP_call_ref %__lex_1_1_save_exec;
2339             DW_OP_deref_type 64, %__uint_64;
2340             DW_OP_LLVM_select_bit_piece 64, 64;
2341           ];
2342         ];
2343         DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2344           DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2345           DW_OP_call_ref %__active_lane_pc;
2346         ];
2347         c;
2348       EXEC = ~EXEC & %3;
2349   $lex_1_1_else:
2350         DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2351           DW_AT_name = "__divergent_lane_pc_1_1_else";
2352           DW_AT_location = DIExpression[
2353             DW_OP_call_ref %__divergent_lane_pc_1_then;
2354             DW_OP_addrx &lex_1_1_end;
2355             DW_OP_stack_value;
2356             DW_OP_LLVM_extend 64, 64;
2357             DW_OP_call_ref %__lex_1_1_save_exec;
2358             DW_OP_deref_type 64, %__uint_64;
2359             DW_OP_LLVM_select_bit_piece 64, 64;
2360           ];
2361         ];
2362         DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2363           DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2364           DW_OP_call_ref %__active_lane_pc;
2365         ];
2366         d;
2367       EXEC = %3;
2368   $lex_1_1_end:
2369       DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2370         DW_OP_call_ref %__divergent_lane_pc;
2371         DW_OP_call_ref %__active_lane_pc;
2372       ];
2373       e;
2374     EXEC = ~EXEC & %1;
2375   $lex_1_else:
2376       DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2377         DW_AT_name = "__divergent_lane_pc_1_else";
2378         DW_AT_location = DIExpression[
2379           DW_OP_call_ref %__divergent_lane_pc;
2380           DW_OP_addrx &lex_1_end;
2381           DW_OP_stack_value;
2382           DW_OP_LLVM_extend 64, 64;
2383           DW_OP_call_ref %__lex_1_save_exec;
2384           DW_OP_deref_type 64, %__uint_64;
2385           DW_OP_LLVM_select_bit_piece 64, 64;
2386         ];
2387       ];
2388       DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2389         DW_OP_call_ref %__divergent_lane_pc_1_else;
2390         DW_OP_call_ref %__active_lane_pc;
2391       ];
2392       f;
2393     EXEC = %1;
2394   $lex_1_end:
2395     DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2396       DW_OP_call_ref %__divergent_lane_pc;
2397       DW_OP_call_ref %__active_lane_pc;
2398     ];
2399     g;
2400   $lex_end:
2401
2402 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2403 that are active, with the current program location.
2404
2405 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2406 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2407 instruction, location list entries will be created that describe where the
2408 artificial variables are allocated at any given program location. The compiler
2409 may allocate them to registers or spill them to memory.
2410
2411 The DWARF procedures for each region use the values of the saved execution mask
2412 artificial variables to only update the lanes that are active on entry to the
2413 region. All other lanes retain the value of the enclosing region where they were
2414 last active. If they were not active on entry to the subprogram, then will have
2415 the undefined location description.
2416
2417 Other structured control flow regions can be handled similarly. For example,
2418 loops would set the divergent program location for the region at the end of the
2419 loop. Any lanes active will be in the loop, and any lanes not active must have
2420 exited the loop.
2421
2422 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2423 ``IF/THEN/ELSE`` regions.
2424
2425 The DWARF procedures can use the active lane artificial variable described in
2426 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2427 ``EXEC`` mask in order to support whole or quad wavefront mode.
2428
2429 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2430
2431 ``DW_AT_LLVM_active_lane``
2432 ~~~~~~~~~~~~~~~~~~~~~~~~~~
2433
2434 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2435 entry is used to specify the lanes that are conceptually active for a SIMT
2436 thread.
2437
2438 The execution mask may be modified to implement whole or quad wavefront mode
2439 operations. For example, all lanes may need to temporarily be made active to
2440 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2441 update it to enable the necessary lanes, perform the operations, and then
2442 restore the ``EXEC`` mask from the saved value. While executing the whole
2443 wavefront region, the conceptual execution mask is the saved value, not the
2444 ``EXEC`` value.
2445
2446 This is handled by defining an artificial variable for the active lane mask. The
2447 active lane mask artificial variable would be the actual ``EXEC`` mask for
2448 normal regions, and the saved execution mask for regions where the mask is
2449 temporarily updated. The location list expression created for this artificial
2450 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2451 attribute.
2452
2453 ``DW_AT_LLVM_augmentation``
2454 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2455
2456 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2457 debugger information entry has the following value for the augmentation string:
2458
2459 ::
2460
2461   [amdgpu:v0.0]
2462
2463 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2464 extensions used in the DWARF of the compilation unit. The version number
2465 conforms to [SEMVER]_.
2466
2467 Call Frame Information
2468 ----------------------
2469
2470 DWARF Call Frame Information (CFI) describes how a consumer can virtually
2471 *unwind* call frames in a running process or core dump. See DWARF Version 5
2472 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2473
2474 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2475
2476 1.  ``augmentation`` string contains the following null-terminated UTF-8 string:
2477
2478     ::
2479
2480       [amd:v0.0]
2481
2482     The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2483     extensions used in this CIE or to the FDEs that use it. The version number
2484     conforms to [SEMVER]_.
2485
2486 2.  ``address_size`` for the ``Global`` address space is defined in
2487     :ref:`amdgpu-dwarf-address-space-identifier`.
2488
2489 3.  ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2490
2491 4.  ``code_alignment_factor`` is 4 bytes.
2492
2493     .. TODO::
2494
2495        Add to :ref:`amdgpu-processor-table` table.
2496
2497 5.  ``data_alignment_factor`` is 4 bytes.
2498
2499     .. TODO::
2500
2501        Add to :ref:`amdgpu-processor-table` table.
2502
2503 6.  ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2504     for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2505
2506 7.  ``initial_instructions`` Since a subprogram X with fewer registers can be
2507     called from subprogram Y that has more allocated, X will not change any of
2508     the extra registers as it cannot access them. Therefore, the default rule
2509     for all columns is ``same value``.
2510
2511 For AMDGPU the register number follows the numbering defined in
2512 :ref:`amdgpu-dwarf-register-identifier`.
2513
2514 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2515 the return address to get the address of a byte within the call site
2516 instructions. See DWARF Version 5 section 6.4.4.
2517
2518 Accelerated Access
2519 ------------------
2520
2521 See DWARF Version 5 section 6.1.
2522
2523 Lookup By Name Section Header
2524 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2525
2526 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2527
2528 For AMDGPU the lookup by name section header table:
2529
2530 ``augmentation_string_size`` (uword)
2531
2532   Set to the length of the ``augmentation_string`` value which is always a
2533   multiple of 4.
2534
2535 ``augmentation_string`` (sequence of UTF-8 characters)
2536
2537   Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2538
2539   ::
2540
2541     [amdgpu:v0.0]
2542
2543   The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2544   extensions used in the DWARF of this index. The version number conforms to
2545   [SEMVER]_.
2546
2547   .. note::
2548
2549     This is different to the DWARF Version 5 definition that requires the first
2550     4 characters to be the vendor ID. But this is consistent with the other
2551     augmentation strings and does allow multiple vendor contributions. However,
2552     backwards compatibility may be more desirable.
2553
2554 Lookup By Address Section Header
2555 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2556
2557 See DWARF Version 5 section 6.1.2.
2558
2559 For AMDGPU the lookup by address section header table:
2560
2561 ``address_size`` (ubyte)
2562
2563   Match the address size for the ``Global`` address space defined in
2564   :ref:`amdgpu-dwarf-address-space-identifier`.
2565
2566 ``segment_selector_size`` (ubyte)
2567
2568   AMDGPU does not use a segment selector so this is 0. The entries in the
2569   ``.debug_aranges`` do not have a segment selector.
2570
2571 Line Number Information
2572 -----------------------
2573
2574 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2575
2576 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2577 The instruction set must be obtained from the ELF file header ``e_flags`` field
2578 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2579 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2580
2581 .. TODO::
2582
2583   Should the ``isa`` state machine register be used to indicate if the code is
2584   in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2585
2586 For AMDGPU the line number program header fields have the following values (see
2587 DWARF Version 5 section 6.2.4):
2588
2589 ``address_size`` (ubyte)
2590   Matches the address size for the ``Global`` address space defined in
2591   :ref:`amdgpu-dwarf-address-space-identifier`.
2592
2593 ``segment_selector_size`` (ubyte)
2594   AMDGPU does not use a segment selector so this is 0.
2595
2596 ``minimum_instruction_length`` (ubyte)
2597   For GFX9-GFX11 this is 4.
2598
2599 ``maximum_operations_per_instruction`` (ubyte)
2600   For GFX9-GFX11 this is 1.
2601
2602 Source text for online-compiled programs (for example, those compiled by the
2603 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2604 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2605 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2606 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2607
2608 The Clang option used to control source embedding in AMDGPU is defined in
2609 :ref:`amdgpu-clang-debug-options-table`.
2610
2611   .. table:: AMDGPU Clang Debug Options
2612      :name: amdgpu-clang-debug-options-table
2613
2614      ==================== ==================================================
2615      Debug Flag           Description
2616      ==================== ==================================================
2617      -g[no-]embed-source  Enable/disable embedding source text in DWARF
2618                           debug sections. Useful for environments where
2619                           source cannot be written to disk, such as
2620                           when performing online compilation.
2621      ==================== ==================================================
2622
2623 For example:
2624
2625 ``-gembed-source``
2626   Enable the embedded source.
2627
2628 ``-gno-embed-source``
2629   Disable the embedded source.
2630
2631 32-Bit and 64-Bit DWARF Formats
2632 -------------------------------
2633
2634 See DWARF Version 5 section 7.4 and
2635 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2636
2637 For AMDGPU:
2638
2639 * For the ``amdgcn`` target architecture only the 64-bit process address space
2640   is supported.
2641
2642 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2643   the 32-bit DWARF format.
2644
2645 Unit Headers
2646 ------------
2647
2648 For AMDGPU the following values apply for each of the unit headers described in
2649 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2650
2651 ``address_size`` (ubyte)
2652   Matches the address size for the ``Global`` address space defined in
2653   :ref:`amdgpu-dwarf-address-space-identifier`.
2654
2655 .. _amdgpu-code-conventions:
2656
2657 Code Conventions
2658 ================
2659
2660 This section provides code conventions used for each supported target triple OS
2661 (see :ref:`amdgpu-target-triples`).
2662
2663 AMDHSA
2664 ------
2665
2666 This section provides code conventions used when the target triple OS is
2667 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
2668
2669 .. _amdgpu-amdhsa-code-object-metadata:
2670
2671 Code Object Metadata
2672 ~~~~~~~~~~~~~~~~~~~~
2673
2674 The code object metadata specifies extensible metadata associated with the code
2675 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2676 encoding and semantics of this metadata depends on the code object version; see
2677 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2678 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2679 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2680 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
2681
2682 Code object metadata is specified in a note record (see
2683 :ref:`amdgpu-note-records`) and is required when the target triple OS is
2684 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2685 information necessary to support the HSA compatible runtime kernel queries. For
2686 example, the segment sizes needed in a dispatch packet. In addition, a
2687 high-level language runtime may require other information to be included. For
2688 example, the AMD OpenCL runtime records kernel argument information.
2689
2690 .. _amdgpu-amdhsa-code-object-metadata-v2:
2691
2692 Code Object V2 Metadata
2693 +++++++++++++++++++++++
2694
2695 .. warning::
2696   Code object V2 is not the default code object version emitted by this version
2697   of LLVM.
2698
2699 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2700 (see :ref:`amdgpu-note-records-v2`).
2701
2702 The metadata is specified as a YAML formatted string (see [YAML]_ and
2703 :doc:`YamlIO`).
2704
2705 .. TODO::
2706
2707   Is the string null terminated? It probably should not if YAML allows it to
2708   contain null characters, otherwise it should be.
2709
2710 The metadata is represented as a single YAML document comprised of the mapping
2711 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2712 referenced tables.
2713
2714 For boolean values, the string values of ``false`` and ``true`` are used for
2715 false and true respectively.
2716
2717 Additional information can be added to the mappings. To avoid conflicts, any
2718 non-AMD key names should be prefixed by "*vendor-name*.".
2719
2720   .. table:: AMDHSA Code Object V2 Metadata Map
2721      :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2722
2723      ========== ============== ========= =======================================
2724      String Key Value Type     Required? Description
2725      ========== ============== ========= =======================================
2726      "Version"  sequence of    Required  - The first integer is the major
2727                 2 integers                 version. Currently 1.
2728                                          - The second integer is the minor
2729                                            version. Currently 0.
2730      "Printf"   sequence of              Each string is encoded information
2731                 strings                  about a printf function call. The
2732                                          encoded information is organized as
2733                                          fields separated by colon (':'):
2734
2735                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2736
2737                                          where:
2738
2739                                          ``ID``
2740                                            A 32-bit integer as a unique id for
2741                                            each printf function call
2742
2743                                          ``N``
2744                                            A 32-bit integer equal to the number
2745                                            of arguments of printf function call
2746                                            minus 1
2747
2748                                          ``S[i]`` (where i = 0, 1, ... , N-1)
2749                                            32-bit integers for the size in bytes
2750                                            of the i-th FormatString argument of
2751                                            the printf function call
2752
2753                                          FormatString
2754                                            The format string passed to the
2755                                            printf function call.
2756      "Kernels"  sequence of    Required  Sequence of the mappings for each
2757                 mapping                  kernel in the code object. See
2758                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2759                                          for the definition of the mapping.
2760      ========== ============== ========= =======================================
2761
2762 ..
2763
2764   .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2765      :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2766
2767      ================= ============== ========= ================================
2768      String Key        Value Type     Required? Description
2769      ================= ============== ========= ================================
2770      "Name"            string         Required  Source name of the kernel.
2771      "SymbolName"      string         Required  Name of the kernel
2772                                                 descriptor ELF symbol.
2773      "Language"        string                   Source language of the kernel.
2774                                                 Values include:
2775
2776                                                 - "OpenCL C"
2777                                                 - "OpenCL C++"
2778                                                 - "HCC"
2779                                                 - "OpenMP"
2780
2781      "LanguageVersion" sequence of              - The first integer is the major
2782                        2 integers                 version.
2783                                                 - The second integer is the
2784                                                   minor version.
2785      "Attrs"           mapping                  Mapping of kernel attributes.
2786                                                 See
2787                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2788                                                 for the mapping definition.
2789      "Args"            sequence of              Sequence of mappings of the
2790                        mapping                  kernel arguments. See
2791                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2792                                                 for the definition of the mapping.
2793      "CodeProps"       mapping                  Mapping of properties related to
2794                                                 the kernel code. See
2795                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2796                                                 for the mapping definition.
2797      ================= ============== ========= ================================
2798
2799 ..
2800
2801   .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2802      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2803
2804      =================== ============== ========= ==============================
2805      String Key          Value Type     Required? Description
2806      =================== ============== ========= ==============================
2807      "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
2808                          3 integers               must be >=1 and the dispatch
2809                                                   work-group size X, Y, Z must
2810                                                   correspond to the specified
2811                                                   values. Defaults to 0, 0, 0.
2812
2813                                                   Corresponds to the OpenCL
2814                                                   ``reqd_work_group_size``
2815                                                   attribute.
2816      "WorkGroupSizeHint" sequence of              The dispatch work-group size
2817                          3 integers               X, Y, Z is likely to be the
2818                                                   specified values.
2819
2820                                                   Corresponds to the OpenCL
2821                                                   ``work_group_size_hint``
2822                                                   attribute.
2823      "VecTypeHint"       string                   The name of a scalar or vector
2824                                                   type.
2825
2826                                                   Corresponds to the OpenCL
2827                                                   ``vec_type_hint`` attribute.
2828
2829      "RuntimeHandle"     string                   The external symbol name
2830                                                   associated with a kernel.
2831                                                   OpenCL runtime allocates a
2832                                                   global buffer for the symbol
2833                                                   and saves the kernel's address
2834                                                   to it, which is used for
2835                                                   device side enqueueing. Only
2836                                                   available for device side
2837                                                   enqueued kernels.
2838      =================== ============== ========= ==============================
2839
2840 ..
2841
2842   .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2843      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2844
2845      ================= ============== ========= ================================
2846      String Key        Value Type     Required? Description
2847      ================= ============== ========= ================================
2848      "Name"            string                   Kernel argument name.
2849      "TypeName"        string                   Kernel argument type name.
2850      "Size"            integer        Required  Kernel argument size in bytes.
2851      "Align"           integer        Required  Kernel argument alignment in
2852                                                 bytes. Must be a power of two.
2853      "ValueKind"       string         Required  Kernel argument kind that
2854                                                 specifies how to set up the
2855                                                 corresponding argument.
2856                                                 Values include:
2857
2858                                                 "ByValue"
2859                                                   The argument is copied
2860                                                   directly into the kernarg.
2861
2862                                                 "GlobalBuffer"
2863                                                   A global address space pointer
2864                                                   to the buffer data is passed
2865                                                   in the kernarg.
2866
2867                                                 "DynamicSharedPointer"
2868                                                   A group address space pointer
2869                                                   to dynamically allocated LDS
2870                                                   is passed in the kernarg.
2871
2872                                                 "Sampler"
2873                                                   A global address space
2874                                                   pointer to a S# is passed in
2875                                                   the kernarg.
2876
2877                                                 "Image"
2878                                                   A global address space
2879                                                   pointer to a T# is passed in
2880                                                   the kernarg.
2881
2882                                                 "Pipe"
2883                                                   A global address space pointer
2884                                                   to an OpenCL pipe is passed in
2885                                                   the kernarg.
2886
2887                                                 "Queue"
2888                                                   A global address space pointer
2889                                                   to an OpenCL device enqueue
2890                                                   queue is passed in the
2891                                                   kernarg.
2892
2893                                                 "HiddenGlobalOffsetX"
2894                                                   The OpenCL grid dispatch
2895                                                   global offset for the X
2896                                                   dimension is passed in the
2897                                                   kernarg.
2898
2899                                                 "HiddenGlobalOffsetY"
2900                                                   The OpenCL grid dispatch
2901                                                   global offset for the Y
2902                                                   dimension is passed in the
2903                                                   kernarg.
2904
2905                                                 "HiddenGlobalOffsetZ"
2906                                                   The OpenCL grid dispatch
2907                                                   global offset for the Z
2908                                                   dimension is passed in the
2909                                                   kernarg.
2910
2911                                                 "HiddenNone"
2912                                                   An argument that is not used
2913                                                   by the kernel. Space needs to
2914                                                   be left for it, but it does
2915                                                   not need to be set up.
2916
2917                                                 "HiddenPrintfBuffer"
2918                                                   A global address space pointer
2919                                                   to the runtime printf buffer
2920                                                   is passed in kernarg. Mutually
2921                                                   exclusive with
2922                                                   "HiddenHostcallBuffer".
2923
2924                                                 "HiddenHostcallBuffer"
2925                                                   A global address space pointer
2926                                                   to the runtime hostcall buffer
2927                                                   is passed in kernarg. Mutually
2928                                                   exclusive with
2929                                                   "HiddenPrintfBuffer".
2930
2931                                                 "HiddenDefaultQueue"
2932                                                   A global address space pointer
2933                                                   to the OpenCL device enqueue
2934                                                   queue that should be used by
2935                                                   the kernel by default is
2936                                                   passed in the kernarg.
2937
2938                                                 "HiddenCompletionAction"
2939                                                   A global address space pointer
2940                                                   to help link enqueued kernels into
2941                                                   the ancestor tree for determining
2942                                                   when the parent kernel has finished.
2943
2944                                                 "HiddenMultiGridSyncArg"
2945                                                   A global address space pointer for
2946                                                   multi-grid synchronization is
2947                                                   passed in the kernarg.
2948
2949      "ValueType"       string                   Unused and deprecated. This should no longer
2950                                                 be emitted, but is accepted for compatibility.
2951
2952
2953      "PointeeAlign"    integer                  Alignment in bytes of pointee
2954                                                 type for pointer type kernel
2955                                                 argument. Must be a power
2956                                                 of 2. Only present if
2957                                                 "ValueKind" is
2958                                                 "DynamicSharedPointer".
2959      "AddrSpaceQual"   string                   Kernel argument address space
2960                                                 qualifier. Only present if
2961                                                 "ValueKind" is "GlobalBuffer" or
2962                                                 "DynamicSharedPointer". Values
2963                                                 are:
2964
2965                                                 - "Private"
2966                                                 - "Global"
2967                                                 - "Constant"
2968                                                 - "Local"
2969                                                 - "Generic"
2970                                                 - "Region"
2971
2972                                                 .. TODO::
2973
2974                                                    Is GlobalBuffer only Global
2975                                                    or Constant? Is
2976                                                    DynamicSharedPointer always
2977                                                    Local? Can HCC allow Generic?
2978                                                    How can Private or Region
2979                                                    ever happen?
2980
2981      "AccQual"         string                   Kernel argument access
2982                                                 qualifier. Only present if
2983                                                 "ValueKind" is "Image" or
2984                                                 "Pipe". Values
2985                                                 are:
2986
2987                                                 - "ReadOnly"
2988                                                 - "WriteOnly"
2989                                                 - "ReadWrite"
2990
2991                                                 .. TODO::
2992
2993                                                    Does this apply to
2994                                                    GlobalBuffer?
2995
2996      "ActualAccQual"   string                   The actual memory accesses
2997                                                 performed by the kernel on the
2998                                                 kernel argument. Only present if
2999                                                 "ValueKind" is "GlobalBuffer",
3000                                                 "Image", or "Pipe". This may be
3001                                                 more restrictive than indicated
3002                                                 by "AccQual" to reflect what the
3003                                                 kernel actual does. If not
3004                                                 present then the runtime must
3005                                                 assume what is implied by
3006                                                 "AccQual" and "IsConst". Values
3007                                                 are:
3008
3009                                                 - "ReadOnly"
3010                                                 - "WriteOnly"
3011                                                 - "ReadWrite"
3012
3013      "IsConst"         boolean                  Indicates if the kernel argument
3014                                                 is const qualified. Only present
3015                                                 if "ValueKind" is
3016                                                 "GlobalBuffer".
3017
3018      "IsRestrict"      boolean                  Indicates if the kernel argument
3019                                                 is restrict qualified. Only
3020                                                 present if "ValueKind" is
3021                                                 "GlobalBuffer".
3022
3023      "IsVolatile"      boolean                  Indicates if the kernel argument
3024                                                 is volatile qualified. Only
3025                                                 present if "ValueKind" is
3026                                                 "GlobalBuffer".
3027
3028      "IsPipe"          boolean                  Indicates if the kernel argument
3029                                                 is pipe qualified. Only present
3030                                                 if "ValueKind" is "Pipe".
3031
3032                                                 .. TODO::
3033
3034                                                    Can GlobalBuffer be pipe
3035                                                    qualified?
3036
3037      ================= ============== ========= ================================
3038
3039 ..
3040
3041   .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
3042      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
3043
3044      ============================ ============== ========= =====================
3045      String Key                   Value Type     Required? Description
3046      ============================ ============== ========= =====================
3047      "KernargSegmentSize"         integer        Required  The size in bytes of
3048                                                            the kernarg segment
3049                                                            that holds the values
3050                                                            of the arguments to
3051                                                            the kernel.
3052      "GroupSegmentFixedSize"      integer        Required  The amount of group
3053                                                            segment memory
3054                                                            required by a
3055                                                            work-group in
3056                                                            bytes. This does not
3057                                                            include any
3058                                                            dynamically allocated
3059                                                            group segment memory
3060                                                            that may be added
3061                                                            when the kernel is
3062                                                            dispatched.
3063      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
3064                                                            private address space
3065                                                            memory required for a
3066                                                            work-item in
3067                                                            bytes. If the kernel
3068                                                            uses a dynamic call
3069                                                            stack then additional
3070                                                            space must be added
3071                                                            to this value for the
3072                                                            call stack.
3073      "KernargSegmentAlign"        integer        Required  The maximum byte
3074                                                            alignment of
3075                                                            arguments in the
3076                                                            kernarg segment. Must
3077                                                            be a power of 2.
3078      "WavefrontSize"              integer        Required  Wavefront size. Must
3079                                                            be a power of 2.
3080      "NumSGPRs"                   integer        Required  Number of scalar
3081                                                            registers used by a
3082                                                            wavefront for
3083                                                            GFX6-GFX11. This
3084                                                            includes the special
3085                                                            SGPRs for VCC, Flat
3086                                                            Scratch (GFX7-GFX10)
3087                                                            and XNACK (for
3088                                                            GFX8-GFX10). It does
3089                                                            not include the 16
3090                                                            SGPR added if a trap
3091                                                            handler is
3092                                                            enabled. It is not
3093                                                            rounded up to the
3094                                                            allocation
3095                                                            granularity.
3096      "NumVGPRs"                   integer        Required  Number of vector
3097                                                            registers used by
3098                                                            each work-item for
3099                                                            GFX6-GFX11
3100      "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
3101                                                            work-group size
3102                                                            supported by the
3103                                                            kernel in work-items.
3104                                                            Must be >=1 and
3105                                                            consistent with
3106                                                            ReqdWorkGroupSize if
3107                                                            not 0, 0, 0.
3108      "NumSpilledSGPRs"            integer                  Number of stores from
3109                                                            a scalar register to
3110                                                            a register allocator
3111                                                            created spill
3112                                                            location.
3113      "NumSpilledVGPRs"            integer                  Number of stores from
3114                                                            a vector register to
3115                                                            a register allocator
3116                                                            created spill
3117                                                            location.
3118      ============================ ============== ========= =====================
3119
3120 .. _amdgpu-amdhsa-code-object-metadata-v3:
3121
3122 Code Object V3 Metadata
3123 +++++++++++++++++++++++
3124
3125 .. warning::
3126   Code object V3 is not the default code object version emitted by this version
3127   of LLVM.
3128
3129 Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3130 record (see :ref:`amdgpu-note-records-v3-onwards`).
3131
3132 The metadata is represented as Message Pack formatted binary data (see
3133 [MsgPack]_). The top level is a Message Pack map that includes the
3134 keys defined in table
3135 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3136 tables.
3137
3138 Additional information can be added to the maps. To avoid conflicts,
3139 any key names should be prefixed by "*vendor-name*." where
3140 ``vendor-name`` can be the name of the vendor and specific vendor
3141 tool that generates the information. The prefix is abbreviated to
3142 simply "." when it appears within a map that has been added by the
3143 same *vendor-name*.
3144
3145   .. table:: AMDHSA Code Object V3 Metadata Map
3146      :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3147
3148      ================= ============== ========= =======================================
3149      String Key        Value Type     Required? Description
3150      ================= ============== ========= =======================================
3151      "amdhsa.version"  sequence of    Required  - The first integer is the major
3152                        2 integers                 version. Currently 1.
3153                                                 - The second integer is the minor
3154                                                   version. Currently 0.
3155      "amdhsa.printf"   sequence of              Each string is encoded information
3156                        strings                  about a printf function call. The
3157                                                 encoded information is organized as
3158                                                 fields separated by colon (':'):
3159
3160                                                 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3161
3162                                                 where:
3163
3164                                                 ``ID``
3165                                                   A 32-bit integer as a unique id for
3166                                                   each printf function call
3167
3168                                                 ``N``
3169                                                   A 32-bit integer equal to the number
3170                                                   of arguments of printf function call
3171                                                   minus 1
3172
3173                                                 ``S[i]`` (where i = 0, 1, ... , N-1)
3174                                                   32-bit integers for the size in bytes
3175                                                   of the i-th FormatString argument of
3176                                                   the printf function call
3177
3178                                                 FormatString
3179                                                   The format string passed to the
3180                                                   printf function call.
3181      "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
3182                        map                      kernel in the code object. See
3183                                                 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3184                                                 for the definition of the keys included
3185                                                 in that map.
3186      ================= ============== ========= =======================================
3187
3188 ..
3189
3190   .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3191      :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3192
3193      =================================== ============== ========= ================================
3194      String Key                          Value Type     Required? Description
3195      =================================== ============== ========= ================================
3196      ".name"                             string         Required  Source name of the kernel.
3197      ".symbol"                           string         Required  Name of the kernel
3198                                                                   descriptor ELF symbol.
3199      ".language"                         string                   Source language of the kernel.
3200                                                                   Values include:
3201
3202                                                                   - "OpenCL C"
3203                                                                   - "OpenCL C++"
3204                                                                   - "HCC"
3205                                                                   - "HIP"
3206                                                                   - "OpenMP"
3207                                                                   - "Assembler"
3208
3209      ".language_version"                 sequence of              - The first integer is the major
3210                                          2 integers                 version.
3211                                                                   - The second integer is the
3212                                                                     minor version.
3213      ".args"                             sequence of              Sequence of maps of the
3214                                          map                      kernel arguments. See
3215                                                                   :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3216                                                                   for the definition of the keys
3217                                                                   included in that map.
3218      ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
3219                                          3 integers               must be >=1 and the dispatch
3220                                                                   work-group size X, Y, Z must
3221                                                                   correspond to the specified
3222                                                                   values. Defaults to 0, 0, 0.
3223
3224                                                                   Corresponds to the OpenCL
3225                                                                   ``reqd_work_group_size``
3226                                                                   attribute.
3227      ".workgroup_size_hint"              sequence of              The dispatch work-group size
3228                                          3 integers               X, Y, Z is likely to be the
3229                                                                   specified values.
3230
3231                                                                   Corresponds to the OpenCL
3232                                                                   ``work_group_size_hint``
3233                                                                   attribute.
3234      ".vec_type_hint"                    string                   The name of a scalar or vector
3235                                                                   type.
3236
3237                                                                   Corresponds to the OpenCL
3238                                                                   ``vec_type_hint`` attribute.
3239
3240      ".device_enqueue_symbol"            string                   The external symbol name
3241                                                                   associated with a kernel.
3242                                                                   OpenCL runtime allocates a
3243                                                                   global buffer for the symbol
3244                                                                   and saves the kernel's address
3245                                                                   to it, which is used for
3246                                                                   device side enqueueing. Only
3247                                                                   available for device side
3248                                                                   enqueued kernels.
3249      ".kernarg_segment_size"             integer        Required  The size in bytes of
3250                                                                   the kernarg segment
3251                                                                   that holds the values
3252                                                                   of the arguments to
3253                                                                   the kernel.
3254      ".group_segment_fixed_size"         integer        Required  The amount of group
3255                                                                   segment memory
3256                                                                   required by a
3257                                                                   work-group in
3258                                                                   bytes. This does not
3259                                                                   include any
3260                                                                   dynamically allocated
3261                                                                   group segment memory
3262                                                                   that may be added
3263                                                                   when the kernel is
3264                                                                   dispatched.
3265      ".private_segment_fixed_size"       integer        Required  The amount of fixed
3266                                                                   private address space
3267                                                                   memory required for a
3268                                                                   work-item in
3269                                                                   bytes. If the kernel
3270                                                                   uses a dynamic call
3271                                                                   stack then additional
3272                                                                   space must be added
3273                                                                   to this value for the
3274                                                                   call stack.
3275      ".kernarg_segment_align"            integer        Required  The maximum byte
3276                                                                   alignment of
3277                                                                   arguments in the
3278                                                                   kernarg segment. Must
3279                                                                   be a power of 2.
3280      ".wavefront_size"                   integer        Required  Wavefront size. Must
3281                                                                   be a power of 2.
3282      ".sgpr_count"                       integer        Required  Number of scalar
3283                                                                   registers required by a
3284                                                                   wavefront for
3285                                                                   GFX6-GFX9. A register
3286                                                                   is required if it is
3287                                                                   used explicitly, or
3288                                                                   if a higher numbered
3289                                                                   register is used
3290                                                                   explicitly. This
3291                                                                   includes the special
3292                                                                   SGPRs for VCC, Flat
3293                                                                   Scratch (GFX7-GFX9)
3294                                                                   and XNACK (for
3295                                                                   GFX8-GFX9). It does
3296                                                                   not include the 16
3297                                                                   SGPR added if a trap
3298                                                                   handler is
3299                                                                   enabled. It is not
3300                                                                   rounded up to the
3301                                                                   allocation
3302                                                                   granularity.
3303      ".vgpr_count"                       integer        Required  Number of vector
3304                                                                   registers required by
3305                                                                   each work-item for
3306                                                                   GFX6-GFX9. A register
3307                                                                   is required if it is
3308                                                                   used explicitly, or
3309                                                                   if a higher numbered
3310                                                                   register is used
3311                                                                   explicitly.
3312      ".agpr_count"                       integer        Required  Number of accumulator
3313                                                                   registers required by
3314                                                                   each work-item for
3315                                                                   GFX90A, GFX908.
3316      ".max_flat_workgroup_size"          integer        Required  Maximum flat
3317                                                                   work-group size
3318                                                                   supported by the
3319                                                                   kernel in work-items.
3320                                                                   Must be >=1 and
3321                                                                   consistent with
3322                                                                   ReqdWorkGroupSize if
3323                                                                   not 0, 0, 0.
3324      ".sgpr_spill_count"                 integer                  Number of stores from
3325                                                                   a scalar register to
3326                                                                   a register allocator
3327                                                                   created spill
3328                                                                   location.
3329      ".vgpr_spill_count"                 integer                  Number of stores from
3330                                                                   a vector register to
3331                                                                   a register allocator
3332                                                                   created spill
3333                                                                   location.
3334      ".kind"                             string                   The kind of the kernel
3335                                                                   with the following
3336                                                                   values:
3337
3338                                                                   "normal"
3339                                                                     Regular kernels.
3340
3341                                                                   "init"
3342                                                                     These kernels must be
3343                                                                     invoked after loading
3344                                                                     the containing code
3345                                                                     object and must
3346                                                                     complete before any
3347                                                                     normal and fini
3348                                                                     kernels in the same
3349                                                                     code object are
3350                                                                     invoked.
3351
3352                                                                   "fini"
3353                                                                     These kernels must be
3354                                                                     invoked before
3355                                                                     unloading the
3356                                                                     containing code object
3357                                                                     and after all init and
3358                                                                     normal kernels in the
3359                                                                     same code object have
3360                                                                     been invoked and
3361                                                                     completed.
3362
3363                                                                   If omitted, "normal" is
3364                                                                   assumed.
3365      =================================== ============== ========= ================================
3366
3367 ..
3368
3369   .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3370      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3371
3372      ====================== ============== ========= ================================
3373      String Key             Value Type     Required? Description
3374      ====================== ============== ========= ================================
3375      ".name"                string                   Kernel argument name.
3376      ".type_name"           string                   Kernel argument type name.
3377      ".size"                integer        Required  Kernel argument size in bytes.
3378      ".offset"              integer        Required  Kernel argument offset in
3379                                                      bytes. The offset must be a
3380                                                      multiple of the alignment
3381                                                      required by the argument.
3382      ".value_kind"          string         Required  Kernel argument kind that
3383                                                      specifies how to set up the
3384                                                      corresponding argument.
3385                                                      Values include:
3386
3387                                                      "by_value"
3388                                                        The argument is copied
3389                                                        directly into the kernarg.
3390
3391                                                      "global_buffer"
3392                                                        A global address space pointer
3393                                                        to the buffer data is passed
3394                                                        in the kernarg.
3395
3396                                                      "dynamic_shared_pointer"
3397                                                        A group address space pointer
3398                                                        to dynamically allocated LDS
3399                                                        is passed in the kernarg.
3400
3401                                                      "sampler"
3402                                                        A global address space
3403                                                        pointer to a S# is passed in
3404                                                        the kernarg.
3405
3406                                                      "image"
3407                                                        A global address space
3408                                                        pointer to a T# is passed in
3409                                                        the kernarg.
3410
3411                                                      "pipe"
3412                                                        A global address space pointer
3413                                                        to an OpenCL pipe is passed in
3414                                                        the kernarg.
3415
3416                                                      "queue"
3417                                                        A global address space pointer
3418                                                        to an OpenCL device enqueue
3419                                                        queue is passed in the
3420                                                        kernarg.
3421
3422                                                      "hidden_global_offset_x"
3423                                                        The OpenCL grid dispatch
3424                                                        global offset for the X
3425                                                        dimension is passed in the
3426                                                        kernarg.
3427
3428                                                      "hidden_global_offset_y"
3429                                                        The OpenCL grid dispatch
3430                                                        global offset for the Y
3431                                                        dimension is passed in the
3432                                                        kernarg.
3433
3434                                                      "hidden_global_offset_z"
3435                                                        The OpenCL grid dispatch
3436                                                        global offset for the Z
3437                                                        dimension is passed in the
3438                                                        kernarg.
3439
3440                                                      "hidden_none"
3441                                                        An argument that is not used
3442                                                        by the kernel. Space needs to
3443                                                        be left for it, but it does
3444                                                        not need to be set up.
3445
3446                                                      "hidden_printf_buffer"
3447                                                        A global address space pointer
3448                                                        to the runtime printf buffer
3449                                                        is passed in kernarg. Mutually
3450                                                        exclusive with
3451                                                        "hidden_hostcall_buffer"
3452                                                        before Code Object V5.
3453
3454                                                      "hidden_hostcall_buffer"
3455                                                        A global address space pointer
3456                                                        to the runtime hostcall buffer
3457                                                        is passed in kernarg. Mutually
3458                                                        exclusive with
3459                                                        "hidden_printf_buffer"
3460                                                        before Code Object V5.
3461
3462                                                      "hidden_default_queue"
3463                                                        A global address space pointer
3464                                                        to the OpenCL device enqueue
3465                                                        queue that should be used by
3466                                                        the kernel by default is
3467                                                        passed in the kernarg.
3468
3469                                                      "hidden_completion_action"
3470                                                        A global address space pointer
3471                                                        to help link enqueued kernels into
3472                                                        the ancestor tree for determining
3473                                                        when the parent kernel has finished.
3474
3475                                                      "hidden_multigrid_sync_arg"
3476                                                        A global address space pointer for
3477                                                        multi-grid synchronization is
3478                                                        passed in the kernarg.
3479
3480      ".value_type"          string                    Unused and deprecated. This should no longer
3481                                                       be emitted, but is accepted for compatibility.
3482
3483      ".pointee_align"       integer                  Alignment in bytes of pointee
3484                                                      type for pointer type kernel
3485                                                      argument. Must be a power
3486                                                      of 2. Only present if
3487                                                      ".value_kind" is
3488                                                      "dynamic_shared_pointer".
3489      ".address_space"       string                   Kernel argument address space
3490                                                      qualifier. Only present if
3491                                                      ".value_kind" is "global_buffer" or
3492                                                      "dynamic_shared_pointer". Values
3493                                                      are:
3494
3495                                                      - "private"
3496                                                      - "global"
3497                                                      - "constant"
3498                                                      - "local"
3499                                                      - "generic"
3500                                                      - "region"
3501
3502                                                      .. TODO::
3503
3504                                                         Is "global_buffer" only "global"
3505                                                         or "constant"? Is
3506                                                         "dynamic_shared_pointer" always
3507                                                         "local"? Can HCC allow "generic"?
3508                                                         How can "private" or "region"
3509                                                         ever happen?
3510
3511      ".access"              string                   Kernel argument access
3512                                                      qualifier. Only present if
3513                                                      ".value_kind" is "image" or
3514                                                      "pipe". Values
3515                                                      are:
3516
3517                                                      - "read_only"
3518                                                      - "write_only"
3519                                                      - "read_write"
3520
3521                                                      .. TODO::
3522
3523                                                         Does this apply to
3524                                                         "global_buffer"?
3525
3526      ".actual_access"       string                   The actual memory accesses
3527                                                      performed by the kernel on the
3528                                                      kernel argument. Only present if
3529                                                      ".value_kind" is "global_buffer",
3530                                                      "image", or "pipe". This may be
3531                                                      more restrictive than indicated
3532                                                      by ".access" to reflect what the
3533                                                      kernel actual does. If not
3534                                                      present then the runtime must
3535                                                      assume what is implied by
3536                                                      ".access" and ".is_const"      . Values
3537                                                      are:
3538
3539                                                      - "read_only"
3540                                                      - "write_only"
3541                                                      - "read_write"
3542
3543      ".is_const"            boolean                  Indicates if the kernel argument
3544                                                      is const qualified. Only present
3545                                                      if ".value_kind" is
3546                                                      "global_buffer".
3547
3548      ".is_restrict"         boolean                  Indicates if the kernel argument
3549                                                      is restrict qualified. Only
3550                                                      present if ".value_kind" is
3551                                                      "global_buffer".
3552
3553      ".is_volatile"         boolean                  Indicates if the kernel argument
3554                                                      is volatile qualified. Only
3555                                                      present if ".value_kind" is
3556                                                      "global_buffer".
3557
3558      ".is_pipe"             boolean                  Indicates if the kernel argument
3559                                                      is pipe qualified. Only present
3560                                                      if ".value_kind" is "pipe".
3561
3562                                                      .. TODO::
3563
3564                                                         Can "global_buffer" be pipe
3565                                                         qualified?
3566
3567      ====================== ============== ========= ================================
3568
3569 .. _amdgpu-amdhsa-code-object-metadata-v4:
3570
3571 Code Object V4 Metadata
3572 +++++++++++++++++++++++
3573
3574 Code object V4 metadata is the same as
3575 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3576 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
3577
3578   .. table:: AMDHSA Code Object V4 Metadata Map Changes
3579      :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3580
3581      ================= ============== ========= =======================================
3582      String Key        Value Type     Required? Description
3583      ================= ============== ========= =======================================
3584      "amdhsa.version"  sequence of    Required  - The first integer is the major
3585                        2 integers                 version. Currently 1.
3586                                                 - The second integer is the minor
3587                                                   version. Currently 1.
3588      "amdhsa.target"   string         Required  The target name of the code using the syntax:
3589
3590                                                 .. code::
3591
3592                                                   <target-triple> [ "-" <target-id> ]
3593
3594                                                 A canonical target ID must be
3595                                                 used. See :ref:`amdgpu-target-triples`
3596                                                 and :ref:`amdgpu-target-id`.
3597      ================= ============== ========= =======================================
3598
3599 .. _amdgpu-amdhsa-code-object-metadata-v5:
3600
3601 Code Object V5 Metadata
3602 +++++++++++++++++++++++
3603
3604 .. warning::
3605   Code object V5 is not the default code object version emitted by this version
3606   of LLVM.
3607
3608
3609 Code object V5 metadata is the same as
3610 :ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
3611 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5`, table
3612 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5` and table
3613 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
3614
3615   .. table:: AMDHSA Code Object V5 Metadata Map Changes
3616      :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
3617
3618      ================= ============== ========= =======================================
3619      String Key        Value Type     Required? Description
3620      ================= ============== ========= =======================================
3621      "amdhsa.version"  sequence of    Required  - The first integer is the major
3622                        2 integers                 version. Currently 1.
3623                                                 - The second integer is the minor
3624                                                   version. Currently 2.
3625      ================= ============== ========= =======================================
3626
3627 ..
3628
3629   .. table:: AMDHSA Code Object V5 Kernel Metadata Map Additions
3630      :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5
3631
3632      ============================= ============= ========== =======================================
3633      String Key                    Value Type     Required? Description
3634      ============================= ============= ========== =======================================
3635      ".uses_dynamic_stack"         boolean                  Indicates if the generated machine code
3636                                                             is using a dynamically sized stack.
3637      ".workgroup_processor_mode"   boolean                  (GFX10+) Controls ENABLE_WGP_MODE in
3638                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3639      ============================= ============= ========== =======================================
3640
3641 ..
3642
3643   .. table:: AMDHSA Code Object V5 Kernel Attribute Metadata Map
3644      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v5-table
3645
3646      =========================== ============== ========= ==============================
3647      String Key                  Value Type     Required? Description
3648      =========================== ============== ========= ==============================
3649      ".uniform_work_group_size"  integer                  Indicates if the kernel
3650                                                           requires that each dimension
3651                                                           of global size is a multiple
3652                                                           of corresponding dimension of
3653                                                           work-group size. Value of 1
3654                                                           implies true and value of 0
3655                                                           implies false. Metadata is
3656                                                           only emitted when value is 1.
3657      =========================== ============== ========= ==============================
3658
3659 ..
3660
3661 ..
3662
3663   .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
3664      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
3665
3666      ====================== ============== ========= ================================
3667      String Key             Value Type     Required? Description
3668      ====================== ============== ========= ================================
3669      ".value_kind"          string         Required  Kernel argument kind that
3670                                                      specifies how to set up the
3671                                                      corresponding argument.
3672                                                      Values include:
3673                                                      the same as code object V3 metadata
3674                                                      (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
3675                                                      with the following additions:
3676
3677                                                      "hidden_block_count_x"
3678                                                        The grid dispatch work-group count for the X dimension
3679                                                        is passed in the kernarg. Some languages, such as OpenCL,
3680                                                        support a last work-group in each dimension being partial.
3681                                                        This count only includes the non-partial work-group count.
3682                                                        This is not the same as the value in the AQL dispatch packet,
3683                                                        which has the grid size in work-items.
3684
3685                                                      "hidden_block_count_y"
3686                                                        The grid dispatch work-group count for the Y dimension
3687                                                        is passed in the kernarg. Some languages, such as OpenCL,
3688                                                        support a last work-group in each dimension being partial.
3689                                                        This count only includes the non-partial work-group count.
3690                                                        This is not the same as the value in the AQL dispatch packet,
3691                                                        which has the grid size in work-items. If the grid dimensionality
3692                                                        is 1, then must be 1.
3693
3694                                                      "hidden_block_count_z"
3695                                                        The grid dispatch work-group count for the Z dimension
3696                                                        is passed in the kernarg. Some languages, such as OpenCL,
3697                                                        support a last work-group in each dimension being partial.
3698                                                        This count only includes the non-partial work-group count.
3699                                                        This is not the same as the value in the AQL dispatch packet,
3700                                                        which has the grid size in work-items. If the grid dimensionality
3701                                                        is 1 or 2, then must be 1.
3702
3703                                                      "hidden_group_size_x"
3704                                                        The grid dispatch work-group size for the X dimension is
3705                                                        passed in the kernarg. This size only applies to the
3706                                                        non-partial work-groups. This is the same value as the AQL
3707                                                        dispatch packet work-group size.
3708
3709                                                      "hidden_group_size_y"
3710                                                        The grid dispatch work-group size for the Y dimension is
3711                                                        passed in the kernarg. This size only applies to the
3712                                                        non-partial work-groups. This is the same value as the AQL
3713                                                        dispatch packet work-group size. If the grid dimensionality
3714                                                        is 1, then must be 1.
3715
3716                                                      "hidden_group_size_z"
3717                                                        The grid dispatch work-group size for the Z dimension is
3718                                                        passed in the kernarg. This size only applies to the
3719                                                        non-partial work-groups. This is the same value as the AQL
3720                                                        dispatch packet work-group size. If the grid dimensionality
3721                                                        is 1 or 2, then must be 1.
3722
3723                                                      "hidden_remainder_x"
3724                                                        The grid dispatch work group size of the partial work group
3725                                                        of the X dimension, if it exists. Must be zero if a partial
3726                                                        work group does not exist in the X dimension.
3727
3728                                                      "hidden_remainder_y"
3729                                                        The grid dispatch work group size of the partial work group
3730                                                        of the Y dimension, if it exists. Must be zero if a partial
3731                                                        work group does not exist in the Y dimension.
3732
3733                                                      "hidden_remainder_z"
3734                                                        The grid dispatch work group size of the partial work group
3735                                                        of the Z dimension, if it exists. Must be zero if a partial
3736                                                        work group does not exist in the Z dimension.
3737
3738                                                      "hidden_grid_dims"
3739                                                        The grid dispatch dimensionality. This is the same value
3740                                                        as the AQL dispatch packet dimensionality. Must be a value
3741                                                        between 1 and 3.
3742
3743                                                      "hidden_heap_v1"
3744                                                        A global address space pointer to an initialized memory
3745                                                        buffer that conforms to the requirements of the malloc/free
3746                                                        device library V1 version implementation.
3747
3748                                                      "hidden_private_base"
3749                                                        The high 32 bits of the flat addressing private aperture base.
3750                                                        Only used by GFX8 to allow conversion between private segment
3751                                                        and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3752
3753                                                      "hidden_shared_base"
3754                                                        The high 32 bits of the flat addressing shared aperture base.
3755                                                        Only used by GFX8 to allow conversion between shared segment
3756                                                        and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3757
3758                                                      "hidden_queue_ptr"
3759                                                        A global memory address space pointer to the ROCm runtime
3760                                                        ``struct amd_queue_t`` structure for the HSA queue of the
3761                                                        associated dispatch AQL packet. It is only required for pre-GFX9
3762                                                        devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
3763
3764      ====================== ============== ========= ================================
3765
3766 ..
3767
3768 Kernel Dispatch
3769 ~~~~~~~~~~~~~~~
3770
3771 The HSA architected queuing language (AQL) defines a user space memory interface
3772 that can be used to control the dispatch of kernels, in an agent independent
3773 way. An agent can have zero or more AQL queues created for it using an HSA
3774 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3775 are 64 bytes) can be placed. See the *HSA Platform System Architecture
3776 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3777
3778 The packet processor of a kernel agent is responsible for detecting and
3779 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3780 packet processor is implemented by the hardware command processor (CP),
3781 asynchronous dispatch controller (ADC) and shader processor input controller
3782 (SPI).
3783
3784 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3785 the kernel mode driver to initialize and register the AQL queue with CP.
3786
3787 To dispatch a kernel the following actions are performed. This can occur in the
3788 CPU host program, or from an HSA kernel executing on a GPU.
3789
3790 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3791    executed is obtained.
3792 2. A pointer to the kernel descriptor (see
3793    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3794    It must be for a kernel that is contained in a code object that was loaded
3795    by an HSA compatible runtime on the kernel agent with which the AQL queue is
3796    associated.
3797 3. Space is allocated for the kernel arguments using the HSA compatible runtime
3798    allocator for a memory region with the kernarg property for the kernel agent
3799    that will execute the kernel. It must be at least 16-byte aligned.
3800 4. Kernel argument values are assigned to the kernel argument memory
3801    allocation. The layout is defined in the *HSA Programmer's Language
3802    Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3803    kernel argument memory in the same way constant memory is accessed. (Note
3804    that the HSA specification allows an implementation to copy the kernel
3805    argument contents to another location that is accessed by the kernel.)
3806 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3807    runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3808    for the packet. The packet must be set up, and the final write must use an
3809    atomic store release to set the packet kind to ensure the packet contents are
3810    visible to the kernel agent. AQL defines a doorbell signal mechanism to
3811    notify the kernel agent that the AQL queue has been updated. These rules, and
3812    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3813    System Architecture Specification* [HSA]_.
3814 6. A kernel dispatch packet includes information about the actual dispatch,
3815    such as grid and work-group size, together with information from the code
3816    object about the kernel, such as segment sizes. The HSA compatible runtime
3817    queries on the kernel symbol can be used to obtain the code object values
3818    which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
3819 7. CP executes micro-code and is responsible for detecting and setting up the
3820    GPU to execute the wavefronts of a kernel dispatch.
3821 8. CP ensures that when the a wavefront starts executing the kernel machine
3822    code, the scalar general purpose registers (SGPR) and vector general purpose
3823    registers (VGPR) are set up as required by the machine code. The required
3824    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3825    register state is defined in
3826    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
3827 9. The prolog of the kernel machine code (see
3828    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3829    before continuing executing the machine code that corresponds to the kernel.
3830 10. When the kernel dispatch has completed execution, CP signals the completion
3831     signal specified in the kernel dispatch packet if not 0.
3832
3833 .. _amdgpu-amdhsa-memory-spaces:
3834
3835 Memory Spaces
3836 ~~~~~~~~~~~~~
3837
3838 The memory space properties are:
3839
3840   .. table:: AMDHSA Memory Spaces
3841      :name: amdgpu-amdhsa-memory-spaces-table
3842
3843      ================= =========== ======== ======= ==================
3844      Memory Space Name HSA Segment Hardware Address NULL Value
3845                        Name        Name     Size
3846      ================= =========== ======== ======= ==================
3847      Private           private     scratch  32      0x00000000
3848      Local             group       LDS      32      0xFFFFFFFF
3849      Global            global      global   64      0x0000000000000000
3850      Constant          constant    *same as 64      0x0000000000000000
3851                                    global*
3852      Generic           flat        flat     64      0x0000000000000000
3853      Region            N/A         GDS      32      *not implemented
3854                                                     for AMDHSA*
3855      ================= =========== ======== ======= ==================
3856
3857 The global and constant memory spaces both use global virtual addresses, which
3858 are the same virtual address space used by the CPU. However, some virtual
3859 addresses may only be accessible to the CPU, some only accessible by the GPU,
3860 and some by both.
3861
3862 Using the constant memory space indicates that the data will not change during
3863 the execution of the kernel. This allows scalar read instructions to be
3864 used. The vector and scalar L1 caches are invalidated of volatile data before
3865 each kernel dispatch execution to allow constant memory to change values between
3866 kernel dispatches.
3867
3868 The local memory space uses the hardware Local Data Store (LDS) which is
3869 automatically allocated when the hardware creates work-groups of wavefronts, and
3870 freed when all the wavefronts of a work-group have terminated. The data store
3871 (DS) instructions can be used to access it.
3872
3873 The private memory space uses the hardware scratch memory support. If the kernel
3874 uses scratch, then the hardware allocates memory that is accessed using
3875 wavefront lane dword (4 byte) interleaving. The mapping used from private
3876 address to physical address is:
3877
3878   ``wavefront-scratch-base +
3879   (private-address * wavefront-size * 4) +
3880   (wavefront-lane-id * 4)``
3881
3882 There are different ways that the wavefront scratch base address is determined
3883 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3884 memory can be accessed in an interleaved manner using buffer instruction with
3885 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3886 instructions, or by flat instructions. If each lane of a wavefront accesses the
3887 same private address, the interleaving results in adjacent dwords being accessed
3888 and hence requires fewer cache lines to be fetched. Multi-dword access is not
3889 supported except by flat and scratch instructions in GFX9-GFX11.
3890
3891 The generic address space uses the hardware flat address support available in
3892 GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
3893 local apertures), that are outside the range of addressible global memory, to
3894 map from a flat address to a private or local address.
3895
3896 FLAT instructions can take a flat address and access global, private (scratch)
3897 and group (LDS) memory depending on if the address is within one of the
3898 aperture ranges. Flat access to scratch requires hardware aperture setup and
3899 setup in the kernel prologue (see
3900 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3901 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3902 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3903
3904 To convert between a segment address and a flat address the base address of the
3905 apertures address can be used. For GFX7-GFX8 these are available in the
3906 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3907 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3908 GFX9-GFX11 the aperture base addresses are directly available as inline constant
3909 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3910 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3911 which makes it easier to convert from flat to segment or segment to flat.
3912
3913 Image and Samplers
3914 ~~~~~~~~~~~~~~~~~~
3915
3916 Image and sample handles created by an HSA compatible runtime (see
3917 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3918 object respectively. In order to support the HSA ``query_sampler`` operations
3919 two extra dwords are used to store the HSA BRIG enumeration values for the
3920 queries that are not trivially deducible from the S# representation.
3921
3922 HSA Signals
3923 ~~~~~~~~~~~
3924
3925 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3926 are 64-bit addresses of a structure allocated in memory accessible from both the
3927 CPU and GPU. The structure is defined by the runtime and subject to change
3928 between releases. For example, see [AMD-ROCm-github]_.
3929
3930 .. _amdgpu-amdhsa-hsa-aql-queue:
3931
3932 HSA AQL Queue
3933 ~~~~~~~~~~~~~
3934
3935 The HSA AQL queue structure is defined by an HSA compatible runtime (see
3936 :ref:`amdgpu-os`) and subject to change between releases. For example, see
3937 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
3938 certain language features such as the flat address aperture bases. It also
3939 contains fields used by CP such as managing the allocation of scratch memory.
3940
3941 .. _amdgpu-amdhsa-kernel-descriptor:
3942
3943 Kernel Descriptor
3944 ~~~~~~~~~~~~~~~~~
3945
3946 A kernel descriptor consists of the information needed by CP to initiate the
3947 execution of a kernel, including the entry point address of the machine code
3948 that implements the kernel.
3949
3950 Code Object V3 Kernel Descriptor
3951 ++++++++++++++++++++++++++++++++
3952
3953 CP microcode requires the Kernel descriptor to be allocated on 64-byte
3954 alignment.
3955
3956 The fields used by CP for code objects before V3 also match those specified in
3957 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3958
3959   .. table:: Code Object V3 Kernel Descriptor
3960      :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3961
3962      ======= ======= =============================== ============================
3963      Bits    Size    Field Name                      Description
3964      ======= ======= =============================== ============================
3965      31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
3966                                                      address space memory
3967                                                      required for a work-group
3968                                                      in bytes. This does not
3969                                                      include any dynamically
3970                                                      allocated local address
3971                                                      space memory that may be
3972                                                      added when the kernel is
3973                                                      dispatched.
3974      63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
3975                                                      private address space
3976                                                      memory required for a
3977                                                      work-item in bytes.  When
3978                                                      this cannot be predicted,
3979                                                      code object v4 and older
3980                                                      sets this value to be
3981                                                      higher than the minimum
3982                                                      requirement.
3983      95:64   4 bytes KERNARG_SIZE                    The size of the kernarg
3984                                                      memory pointed to by the
3985                                                      AQL dispatch packet. The
3986                                                      kernarg memory is used to
3987                                                      pass arguments to the
3988                                                      kernel.
3989
3990                                                      * If the kernarg pointer in
3991                                                        the dispatch packet is NULL
3992                                                        then there are no kernel
3993                                                        arguments.
3994                                                      * If the kernarg pointer in
3995                                                        the dispatch packet is
3996                                                        not NULL and this value
3997                                                        is 0 then the kernarg
3998                                                        memory size is
3999                                                        unspecified.
4000                                                      * If the kernarg pointer in
4001                                                        the dispatch packet is
4002                                                        not NULL and this value
4003                                                        is not 0 then the value
4004                                                        specifies the kernarg
4005                                                        memory size in bytes. It
4006                                                        is recommended to provide
4007                                                        a value as it may be used
4008                                                        by CP to optimize making
4009                                                        the kernarg memory
4010                                                        visible to the kernel
4011                                                        code.
4012
4013      127:96  4 bytes                                 Reserved, must be 0.
4014      191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
4015                                                      negative) from base
4016                                                      address of kernel
4017                                                      descriptor to kernel's
4018                                                      entry point instruction
4019                                                      which must be 256 byte
4020                                                      aligned.
4021      351:272 20                                      Reserved, must be 0.
4022              bytes
4023      383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-GFX9
4024                                                        Reserved, must be 0.
4025                                                      GFX90A, GFX940
4026                                                        Compute Shader (CS)
4027                                                        program settings used by
4028                                                        CP to set up
4029                                                        ``COMPUTE_PGM_RSRC3``
4030                                                        configuration
4031                                                        register. See
4032                                                        :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
4033                                                      GFX10-GFX11
4034                                                        Compute Shader (CS)
4035                                                        program settings used by
4036                                                        CP to set up
4037                                                        ``COMPUTE_PGM_RSRC3``
4038                                                        configuration
4039                                                        register. See
4040                                                        :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
4041      415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
4042                                                      program settings used by
4043                                                      CP to set up
4044                                                      ``COMPUTE_PGM_RSRC1``
4045                                                      configuration
4046                                                      register. See
4047                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
4048      447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
4049                                                      program settings used by
4050                                                      CP to set up
4051                                                      ``COMPUTE_PGM_RSRC2``
4052                                                      configuration
4053                                                      register. See
4054                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
4055      458:448 7 bits  *See separate bits below.*      Enable the setup of the
4056                                                      SGPR user data registers
4057                                                      (see
4058                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4059
4060                                                      The total number of SGPR
4061                                                      user data registers
4062                                                      requested must not exceed
4063                                                      16 and match value in
4064                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
4065                                                      Any requests beyond 16
4066                                                      will be ignored.
4067      >448    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     If the *Target Properties*
4068                      _BUFFER                         column of
4069                                                      :ref:`amdgpu-processor-table`
4070                                                      specifies *Architected flat
4071                                                      scratch* then not supported
4072                                                      and must be 0,
4073      >449    1 bit   ENABLE_SGPR_DISPATCH_PTR
4074      >450    1 bit   ENABLE_SGPR_QUEUE_PTR
4075      >451    1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR
4076      >452    1 bit   ENABLE_SGPR_DISPATCH_ID
4077      >453    1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   If the *Target Properties*
4078                                                      column of
4079                                                      :ref:`amdgpu-processor-table`
4080                                                      specifies *Architected flat
4081                                                      scratch* then not supported
4082                                                      and must be 0,
4083      >454    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT
4084                      _SIZE
4085      457:455 3 bits                                  Reserved, must be 0.
4086      458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-GFX9
4087                                                        Reserved, must be 0.
4088                                                      GFX10-GFX11
4089                                                        - If 0 execute in
4090                                                          wavefront size 64 mode.
4091                                                        - If 1 execute in
4092                                                          native wavefront size
4093                                                          32 mode.
4094      459     1 bit   USES_DYNAMIC_STACK              Indicates if the generated
4095                                                      machine code is using a
4096                                                      dynamically sized stack.
4097                                                      This is only set in code
4098                                                      object v5 and later.
4099      463:460 1 bit                                   Reserved, must be 0.
4100      464     1 bit   RESERVED_464                    Deprecated, must be 0.
4101      467:465 3 bits                                  Reserved, must be 0.
4102      468     1 bit   RESERVED_468                    Deprecated, must be 0.
4103      469:471 3 bits                                  Reserved, must be 0.
4104      511:472 5 bytes                                 Reserved, must be 0.
4105      512     **Total size 64 bytes.**
4106      ======= ====================================================================
4107
4108 ..
4109
4110   .. table:: compute_pgm_rsrc1 for GFX6-GFX11
4111      :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table
4112
4113      ======= ======= =============================== ===========================================================================
4114      Bits    Size    Field Name                      Description
4115      ======= ======= =============================== ===========================================================================
4116      5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
4117                                                      blocks used by each work-item;
4118                                                      granularity is device
4119                                                      specific:
4120
4121                                                      GFX6-GFX9
4122                                                        - vgprs_used 0..256
4123                                                        - max(0, ceil(vgprs_used / 4) - 1)
4124                                                      GFX90A, GFX940
4125                                                        - vgprs_used 0..512
4126                                                        - vgprs_used = align(arch_vgprs, 4)
4127                                                                       + acc_vgprs
4128                                                        - max(0, ceil(vgprs_used / 8) - 1)
4129                                                      GFX10-GFX11 (wavefront size 64)
4130                                                        - max_vgpr 1..256
4131                                                        - max(0, ceil(vgprs_used / 4) - 1)
4132                                                      GFX10-GFX11 (wavefront size 32)
4133                                                        - max_vgpr 1..256
4134                                                        - max(0, ceil(vgprs_used / 8) - 1)
4135
4136                                                      Where vgprs_used is defined
4137                                                      as the highest VGPR number
4138                                                      explicitly referenced plus
4139                                                      one.
4140
4141                                                      Used by CP to set up
4142                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
4143
4144                                                      The
4145                                                      :ref:`amdgpu-assembler`
4146                                                      calculates this
4147                                                      automatically for the
4148                                                      selected processor from
4149                                                      values provided to the
4150                                                      `.amdhsa_kernel` directive
4151                                                      by the
4152                                                      `.amdhsa_next_free_vgpr`
4153                                                      nested directive (see
4154                                                      :ref:`amdhsa-kernel-directives-table`).
4155      9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4156                                                      blocks used by a wavefront;
4157                                                      granularity is device
4158                                                      specific:
4159
4160                                                      GFX6-GFX8
4161                                                        - sgprs_used 0..112
4162                                                        - max(0, ceil(sgprs_used / 8) - 1)
4163                                                      GFX9
4164                                                        - sgprs_used 0..112
4165                                                        - 2 * max(0, ceil(sgprs_used / 16) - 1)
4166                                                      GFX10-GFX11
4167                                                        Reserved, must be 0.
4168                                                        (128 SGPRs always
4169                                                        allocated.)
4170
4171                                                      Where sgprs_used is
4172                                                      defined as the highest
4173                                                      SGPR number explicitly
4174                                                      referenced plus one, plus
4175                                                      a target specific number
4176                                                      of additional special
4177                                                      SGPRs for VCC,
4178                                                      FLAT_SCRATCH (GFX7+) and
4179                                                      XNACK_MASK (GFX8+), and
4180                                                      any additional
4181                                                      target specific
4182                                                      limitations. It does not
4183                                                      include the 16 SGPRs added
4184                                                      if a trap handler is
4185                                                      enabled.
4186
4187                                                      The target specific
4188                                                      limitations and special
4189                                                      SGPR layout are defined in
4190                                                      the hardware
4191                                                      documentation, which can
4192                                                      be found in the
4193                                                      :ref:`amdgpu-processors`
4194                                                      table.
4195
4196                                                      Used by CP to set up
4197                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
4198
4199                                                      The
4200                                                      :ref:`amdgpu-assembler`
4201                                                      calculates this
4202                                                      automatically for the
4203                                                      selected processor from
4204                                                      values provided to the
4205                                                      `.amdhsa_kernel` directive
4206                                                      by the
4207                                                      `.amdhsa_next_free_sgpr`
4208                                                      and `.amdhsa_reserve_*`
4209                                                      nested directives (see
4210                                                      :ref:`amdhsa-kernel-directives-table`).
4211      11:10   2 bits  PRIORITY                        Must be 0.
4212
4213                                                      Start executing wavefront
4214                                                      at the specified priority.
4215
4216                                                      CP is responsible for
4217                                                      filling in
4218                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
4219      13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
4220                                                      with specified rounding
4221                                                      mode for single (32
4222                                                      bit) floating point
4223                                                      precision floating point
4224                                                      operations.
4225
4226                                                      Floating point rounding
4227                                                      mode values are defined in
4228                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4229
4230                                                      Used by CP to set up
4231                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4232      15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
4233                                                      with specified rounding
4234                                                      denorm mode for half/double (16
4235                                                      and 64-bit) floating point
4236                                                      precision floating point
4237                                                      operations.
4238
4239                                                      Floating point rounding
4240                                                      mode values are defined in
4241                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4242
4243                                                      Used by CP to set up
4244                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4245      17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
4246                                                      with specified denorm mode
4247                                                      for single (32
4248                                                      bit)  floating point
4249                                                      precision floating point
4250                                                      operations.
4251
4252                                                      Floating point denorm mode
4253                                                      values are defined in
4254                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4255
4256                                                      Used by CP to set up
4257                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4258      19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
4259                                                      with specified denorm mode
4260                                                      for half/double (16
4261                                                      and 64-bit) floating point
4262                                                      precision floating point
4263                                                      operations.
4264
4265                                                      Floating point denorm mode
4266                                                      values are defined in
4267                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4268
4269                                                      Used by CP to set up
4270                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4271      20      1 bit   PRIV                            Must be 0.
4272
4273                                                      Start executing wavefront
4274                                                      in privilege trap handler
4275                                                      mode.
4276
4277                                                      CP is responsible for
4278                                                      filling in
4279                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
4280      21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
4281                                                      with DX10 clamp mode
4282                                                      enabled. Used by the vector
4283                                                      ALU to force DX10 style
4284                                                      treatment of NaN's (when
4285                                                      set, clamp NaN to zero,
4286                                                      otherwise pass NaN
4287                                                      through).
4288
4289                                                      Used by CP to set up
4290                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4291      22      1 bit   DEBUG_MODE                      Must be 0.
4292
4293                                                      Start executing wavefront
4294                                                      in single step mode.
4295
4296                                                      CP is responsible for
4297                                                      filling in
4298                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4299      23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
4300                                                      with IEEE mode
4301                                                      enabled. Floating point
4302                                                      opcodes that support
4303                                                      exception flag gathering
4304                                                      will quiet and propagate
4305                                                      signaling-NaN inputs per
4306                                                      IEEE 754-2008. Min_dx10 and
4307                                                      max_dx10 become IEEE
4308                                                      754-2008 compliant due to
4309                                                      signaling-NaN propagation
4310                                                      and quieting.
4311
4312                                                      Used by CP to set up
4313                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4314      24      1 bit   BULKY                           Must be 0.
4315
4316                                                      Only one work-group allowed
4317                                                      to execute on a compute
4318                                                      unit.
4319
4320                                                      CP is responsible for
4321                                                      filling in
4322                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
4323      25      1 bit   CDBG_USER                       Must be 0.
4324
4325                                                      Flag that can be used to
4326                                                      control debugging code.
4327
4328                                                      CP is responsible for
4329                                                      filling in
4330                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4331      26      1 bit   FP16_OVFL                       GFX6-GFX8
4332                                                        Reserved, must be 0.
4333                                                      GFX9-GFX11
4334                                                        Wavefront starts execution
4335                                                        with specified fp16 overflow
4336                                                        mode.
4337
4338                                                        - If 0, fp16 overflow generates
4339                                                          +/-INF values.
4340                                                        - If 1, fp16 overflow that is the
4341                                                          result of an +/-INF input value
4342                                                          or divide by 0 produces a +/-INF,
4343                                                          otherwise clamps computed
4344                                                          overflow to +/-MAX_FP16 as
4345                                                          appropriate.
4346
4347                                                        Used by CP to set up
4348                                                        ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4349      28:27   2 bits                                  Reserved, must be 0.
4350      29      1 bit    WGP_MODE                       GFX6-GFX9
4351                                                        Reserved, must be 0.
4352                                                      GFX10-GFX11
4353                                                        - If 0 execute work-groups in
4354                                                          CU wavefront execution mode.
4355                                                        - If 1 execute work-groups on
4356                                                          in WGP wavefront execution mode.
4357
4358                                                        See :ref:`amdgpu-amdhsa-memory-model`.
4359
4360                                                        Used by CP to set up
4361                                                        ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4362      30      1 bit    MEM_ORDERED                    GFX6-GFX9
4363                                                        Reserved, must be 0.
4364                                                      GFX10-GFX11
4365                                                        Controls the behavior of the
4366                                                        s_waitcnt's vmcnt and vscnt
4367                                                        counters.
4368
4369                                                        - If 0 vmcnt reports completion
4370                                                          of load and atomic with return
4371                                                          out of order with sample
4372                                                          instructions, and the vscnt
4373                                                          reports the completion of
4374                                                          store and atomic without
4375                                                          return in order.
4376                                                        - If 1 vmcnt reports completion
4377                                                          of load, atomic with return
4378                                                          and sample instructions in
4379                                                          order, and the vscnt reports
4380                                                          the completion of store and
4381                                                          atomic without return in order.
4382
4383                                                        Used by CP to set up
4384                                                        ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4385      31      1 bit    FWD_PROGRESS                   GFX6-GFX9
4386                                                        Reserved, must be 0.
4387                                                      GFX10-GFX11
4388                                                        - If 0 execute SIMD wavefronts
4389                                                          using oldest first policy.
4390                                                        - If 1 execute SIMD wavefronts to
4391                                                          ensure wavefronts will make some
4392                                                          forward progress.
4393
4394                                                        Used by CP to set up
4395                                                        ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4396      32      **Total size 4 bytes**
4397      ======= ===================================================================================================================
4398
4399 ..
4400
4401   .. table:: compute_pgm_rsrc2 for GFX6-GFX11
4402      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table
4403
4404      ======= ======= =============================== ===========================================================================
4405      Bits    Size    Field Name                      Description
4406      ======= ======= =============================== ===========================================================================
4407      0       1 bit   ENABLE_PRIVATE_SEGMENT          * Enable the setup of the
4408                                                        private segment.
4409                                                      * If the *Target Properties*
4410                                                        column of
4411                                                        :ref:`amdgpu-processor-table`
4412                                                        does not specify
4413                                                        *Architected flat
4414                                                        scratch* then enable the
4415                                                        setup of the SGPR
4416                                                        wavefront scratch offset
4417                                                        system register (see
4418                                                        :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4419                                                      * If the *Target Properties*
4420                                                        column of
4421                                                        :ref:`amdgpu-processor-table`
4422                                                        specifies *Architected
4423                                                        flat scratch* then enable
4424                                                        the setup of the
4425                                                        FLAT_SCRATCH register
4426                                                        pair (see
4427                                                        :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4428
4429                                                      Used by CP to set up
4430                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4431      5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
4432                                                      user data
4433                                                      registers requested. This
4434                                                      number must be greater than
4435                                                      or equal to the number of user
4436                                                      data registers enabled.
4437
4438                                                      Used by CP to set up
4439                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4440      6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
4441
4442                                                      This bit represents
4443                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4444                                                      which is set by the CP if
4445                                                      the runtime has installed a
4446                                                      trap handler.
4447      7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
4448                                                      system SGPR register for
4449                                                      the work-group id in the X
4450                                                      dimension (see
4451                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4452
4453                                                      Used by CP to set up
4454                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4455      8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
4456                                                      system SGPR register for
4457                                                      the work-group id in the Y
4458                                                      dimension (see
4459                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4460
4461                                                      Used by CP to set up
4462                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4463      9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
4464                                                      system SGPR register for
4465                                                      the work-group id in the Z
4466                                                      dimension (see
4467                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4468
4469                                                      Used by CP to set up
4470                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4471      10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
4472                                                      system SGPR register for
4473                                                      work-group information (see
4474                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4475
4476                                                      Used by CP to set up
4477                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4478      12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
4479                                                      VGPR system registers used
4480                                                      for the work-item ID.
4481                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4482                                                      defines the values.
4483
4484                                                      Used by CP to set up
4485                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4486      13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
4487
4488                                                      Wavefront starts execution
4489                                                      with address watch
4490                                                      exceptions enabled which
4491                                                      are generated when L1 has
4492                                                      witnessed a thread access
4493                                                      an *address of
4494                                                      interest*.
4495
4496                                                      CP is responsible for
4497                                                      filling in the address
4498                                                      watch bit in
4499                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4500                                                      according to what the
4501                                                      runtime requests.
4502      14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
4503
4504                                                      Wavefront starts execution
4505                                                      with memory violation
4506                                                      exceptions exceptions
4507                                                      enabled which are generated
4508                                                      when a memory violation has
4509                                                      occurred for this wavefront from
4510                                                      L1 or LDS
4511                                                      (write-to-read-only-memory,
4512                                                      mis-aligned atomic, LDS
4513                                                      address out of range,
4514                                                      illegal address, etc.).
4515
4516                                                      CP sets the memory
4517                                                      violation bit in
4518                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4519                                                      according to what the
4520                                                      runtime requests.
4521      23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
4522
4523                                                      CP uses the rounded value
4524                                                      from the dispatch packet,
4525                                                      not this value, as the
4526                                                      dispatch may contain
4527                                                      dynamically allocated group
4528                                                      segment memory. CP writes
4529                                                      directly to
4530                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4531
4532                                                      Amount of group segment
4533                                                      (LDS) to allocate for each
4534                                                      work-group. Granularity is
4535                                                      device specific:
4536
4537                                                      GFX6
4538                                                        roundup(lds-size / (64 * 4))
4539                                                      GFX7-GFX11
4540                                                        roundup(lds-size / (128 * 4))
4541
4542      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
4543                      _INVALID_OPERATION              with specified exceptions
4544                                                      enabled.
4545
4546                                                      Used by CP to set up
4547                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
4548                                                      (set from bits 0..6).
4549
4550                                                      IEEE 754 FP Invalid
4551                                                      Operation
4552      25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
4553                      _SOURCE                         input operands is a
4554                                                      denormal number
4555      26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
4556                      _DIVISION_BY_ZERO               Zero
4557      27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
4558                      _OVERFLOW
4559      28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
4560                      _UNDERFLOW
4561      29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
4562                      _INEXACT
4563      30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
4564                      _ZERO                           (rcp_iflag_f32 instruction
4565                                                      only)
4566      31      1 bit                                   Reserved, must be 0.
4567      32      **Total size 4 bytes.**
4568      ======= ===================================================================================================================
4569
4570 ..
4571
4572   .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
4573      :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4574
4575      ======= ======= =============================== ===========================================================================
4576      Bits    Size    Field Name                      Description
4577      ======= ======= =============================== ===========================================================================
4578      5:0     6 bits  ACCUM_OFFSET                    Offset of a first AccVGPR in the unified register file. Granularity 4.
4579                                                      Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4580                                                      63 - accum-offset = 256.
4581      6:15    10                                      Reserved, must be 0.
4582              bits
4583      16      1 bit   TG_SPLIT                        - If 0 the waves of a work-group are
4584                                                        launched in the same CU.
4585                                                      - If 1 the waves of a work-group can be
4586                                                        launched in different CUs. The waves
4587                                                        cannot use S_BARRIER or LDS.
4588      17:31   15                                      Reserved, must be 0.
4589              bits
4590      32      **Total size 4 bytes.**
4591      ======= ===================================================================================================================
4592
4593 ..
4594
4595   .. table:: compute_pgm_rsrc3 for GFX10-GFX11
4596      :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table
4597
4598      ======= ======= =============================== ===========================================================================
4599      Bits    Size    Field Name                      Description
4600      ======= ======= =============================== ===========================================================================
4601      3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPR blocks when executing in subvector mode. For
4602                                                      wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
4603                                                      of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
4604                                                      not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
4605      9:4     6 bits  INST_PREF_SIZE                  GFX10
4606                                                        Reserved, must be 0.
4607                                                      GFX11
4608                                                        Number of instruction bytes to prefetch, starting at the kernel's entry
4609                                                        point instruction, before wavefront starts execution. The value is 0..63
4610                                                        with a granularity of 128 bytes.
4611      10      1 bit   TRAP_ON_START                   GFX10
4612                                                        Reserved, must be 0.
4613                                                      GFX11
4614                                                        Must be 0.
4615
4616                                                        If 1, wavefront starts execution by trapping into the trap handler.
4617
4618                                                        CP is responsible for filling in the trap on start bit in
4619                                                        ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
4620                                                        requests.
4621      11      1 bit   TRAP_ON_END                     GFX10
4622                                                        Reserved, must be 0.
4623                                                      GFX11
4624                                                        Must be 0.
4625
4626                                                        If 1, wavefront execution terminates by trapping into the trap handler.
4627
4628                                                        CP is responsible for filling in the trap on end bit in
4629                                                        ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
4630      30:12   19 bits                                 Reserved, must be 0.
4631      31      1 bit   IMAGE_OP                        GFX10
4632                                                        Reserved, must be 0.
4633                                                      GFX11
4634                                                        If 1, the kernel execution contains image instructions. If executed as
4635                                                        part of a graphics pipeline, image read instructions will stall waiting
4636                                                        for any necessary ``WAIT_SYNC`` fence to be performed in order to
4637                                                        indicate that earlier pipeline stages have completed writing to the
4638                                                        image.
4639
4640                                                        Not used for compute kernels that are not part of a graphics pipeline and
4641                                                        must be 0.
4642      32      **Total size 4 bytes.**
4643      ======= ===================================================================================================================
4644
4645 ..
4646
4647   .. table:: Floating Point Rounding Mode Enumeration Values
4648      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4649
4650      ====================================== ===== ==============================
4651      Enumeration Name                       Value Description
4652      ====================================== ===== ==============================
4653      FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
4654      FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
4655      FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
4656      FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
4657      ====================================== ===== ==============================
4658
4659 ..
4660
4661   .. table:: Floating Point Denorm Mode Enumeration Values
4662      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4663
4664      ====================================== ===== ====================================
4665      Enumeration Name                       Value Description
4666      ====================================== ===== ====================================
4667      FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination Denorms
4668      FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
4669      FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
4670      FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
4671      ====================================== ===== ====================================
4672
4673   Denormal flushing is sign respecting. i.e. the behavior expected by
4674   ``"denormal-fp-math"="preserve-sign"``. The behavior is undefined with
4675   ``"denormal-fp-math"="positive-zero"``
4676
4677 ..
4678
4679   .. table:: System VGPR Work-Item ID Enumeration Values
4680      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4681
4682      ======================================== ===== ============================
4683      Enumeration Name                         Value Description
4684      ======================================== ===== ============================
4685      SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
4686                                                     ID.
4687      SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
4688                                                     dimensions ID.
4689      SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
4690                                                     dimensions ID.
4691      SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
4692      ======================================== ===== ============================
4693
4694 .. _amdgpu-amdhsa-initial-kernel-execution-state:
4695
4696 Initial Kernel Execution State
4697 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4698
4699 This section defines the register state that will be set up by the packet
4700 processor prior to the start of execution of every wavefront. This is limited by
4701 the constraints of the hardware controllers of CP/ADC/SPI.
4702
4703 The order of the SGPR registers is defined, but the compiler can specify which
4704 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4705 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4706 for enabled registers are dense starting at SGPR0: the first enabled register is
4707 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4708 an SGPR number.
4709
4710 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4711 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4712 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4713 actually initialized. These are then immediately followed by the System SGPRs
4714 that are set up by ADC/SPI and can have different values for each wavefront of
4715 the grid dispatch.
4716
4717 SGPR register initial state is defined in
4718 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4719
4720   .. table:: SGPR Register Set Up Order
4721      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4722
4723      ========== ========================== ====== ==============================
4724      SGPR Order Name                       Number Description
4725                 (kernel descriptor enable  of
4726                 field)                     SGPRs
4727      ========== ========================== ====== ==============================
4728      First      Private Segment Buffer     4      See
4729                 (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4730                 _segment_buffer)
4731      then       Dispatch Ptr               2      64-bit address of AQL dispatch
4732                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
4733                                                   actually executing.
4734      then       Queue Ptr                  2      64-bit address of amd_queue_t
4735                 (enable_sgpr_queue_ptr)           object for AQL queue on which
4736                                                   the dispatch packet was
4737                                                   queued.
4738      then       Kernarg Segment Ptr        2      64-bit address of Kernarg
4739                 (enable_sgpr_kernarg              segment. This is directly
4740                 _segment_ptr)                     copied from the
4741                                                   kernarg_address in the kernel
4742                                                   dispatch packet.
4743
4744                                                   Having CP load it once avoids
4745                                                   loading it at the beginning of
4746                                                   every wavefront.
4747      then       Dispatch Id                2      64-bit Dispatch ID of the
4748                 (enable_sgpr_dispatch_id)         dispatch packet being
4749                                                   executed.
4750      then       Flat Scratch Init          2      See
4751                 (enable_sgpr_flat_scratch         :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4752                 _init)
4753      then       Private Segment Size       1      The 32-bit byte size of a
4754                 (enable_sgpr_private              single work-item's memory
4755                 _segment_size)                    allocation. This is the
4756                                                   value from the kernel
4757                                                   dispatch packet Private
4758                                                   Segment Byte Size rounded up
4759                                                   by CP to a multiple of
4760                                                   DWORD.
4761
4762                                                   Having CP load it once avoids
4763                                                   loading it at the beginning of
4764                                                   every wavefront.
4765
4766                                                   This is not used for
4767                                                   GFX7-GFX8 since it is the same
4768                                                   value as the second SGPR of
4769                                                   Flat Scratch Init. However, it
4770                                                   may be needed for GFX9-GFX11 which
4771                                                   changes the meaning of the
4772                                                   Flat Scratch Init value.
4773      then       Work-Group Id X            1      32-bit work-group id in X
4774                 (enable_sgpr_workgroup_id         dimension of grid for
4775                 _X)                               wavefront.
4776      then       Work-Group Id Y            1      32-bit work-group id in Y
4777                 (enable_sgpr_workgroup_id         dimension of grid for
4778                 _Y)                               wavefront.
4779      then       Work-Group Id Z            1      32-bit work-group id in Z
4780                 (enable_sgpr_workgroup_id         dimension of grid for
4781                 _Z)                               wavefront.
4782      then       Work-Group Info            1      {first_wavefront, 14'b0000,
4783                 (enable_sgpr_workgroup            ordered_append_term[10:0],
4784                 _info)                            threadgroup_size_in_wavefronts[5:0]}
4785      then       Scratch Wavefront Offset   1      See
4786                 (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4787                 _segment_wavefront_offset)        and
4788                                                   :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4789      ========== ========================== ====== ==============================
4790
4791 The order of the VGPR registers is defined, but the compiler can specify which
4792 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4793 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4794 for enabled registers are dense starting at VGPR0: the first enabled register is
4795 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4796 VGPR number.
4797
4798 There are different methods used for the VGPR initial state:
4799
4800 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4801   specifies otherwise, a separate VGPR register is used per work-item ID. The
4802   VGPR register initial state for this method is defined in
4803   :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4804 * If *Target Properties* column of :ref:`amdgpu-processor-table`
4805   specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4806   for all work-item IDs. The register layout for this method is defined in
4807   :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4808
4809   .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4810      :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4811
4812      ========== ========================== ====== ==============================
4813      VGPR Order Name                       Number Description
4814                 (kernel descriptor enable  of
4815                 field)                     VGPRs
4816      ========== ========================== ====== ==============================
4817      First      Work-Item Id X             1      32-bit work-item id in X
4818                 (Always initialized)              dimension of work-group for
4819                                                   wavefront lane.
4820      then       Work-Item Id Y             1      32-bit work-item id in Y
4821                 (enable_vgpr_workitem_id          dimension of work-group for
4822                 > 0)                              wavefront lane.
4823      then       Work-Item Id Z             1      32-bit work-item id in Z
4824                 (enable_vgpr_workitem_id          dimension of work-group for
4825                 > 1)                              wavefront lane.
4826      ========== ========================== ====== ==============================
4827
4828 ..
4829
4830   .. table:: Register Layout for Packed Work-Item ID Method
4831      :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4832
4833      ======= ======= ================ =========================================
4834      Bits    Size    Field Name       Description
4835      ======= ======= ================ =========================================
4836      0:9     10 bits Work-Item Id X   Work-item id in X
4837                                       dimension of work-group for
4838                                       wavefront lane.
4839
4840                                       Always initialized.
4841
4842      10:19   10 bits Work-Item Id Y   Work-item id in Y
4843                                       dimension of work-group for
4844                                       wavefront lane.
4845
4846                                       Initialized if enable_vgpr_workitem_id >
4847                                       0, otherwise set to 0.
4848      20:29   10 bits Work-Item Id Z   Work-item id in Z
4849                                       dimension of work-group for
4850                                       wavefront lane.
4851
4852                                       Initialized if enable_vgpr_workitem_id >
4853                                       1, otherwise set to 0.
4854      30:31   2 bits                   Reserved, set to 0.
4855      ======= ======= ================ =========================================
4856
4857 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4858
4859 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4860    registers.
4861 2. Work-group Id registers X, Y, Z are set by ADC which supports any
4862    combination including none.
4863 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4864    its value cannot be included with the flat scratch init value which is per
4865    queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
4866 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4867    or (X, Y, Z).
4868 5. Flat Scratch register pair initialization is described in
4869    :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4870
4871 The global segment can be accessed either using buffer instructions (GFX6 which
4872 has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
4873 instructions (GFX9-GFX11).
4874
4875 If buffer operations are used, then the compiler can generate a V# with the
4876 following properties:
4877
4878 * base address of 0
4879 * no swizzle
4880 * ATC: 1 if IOMMU present (such as APU)
4881 * ptr64: 1
4882 * MTYPE set to support memory coherence that matches the runtime (such as CC for
4883   APU and NC for dGPU).
4884
4885 .. _amdgpu-amdhsa-kernel-prolog:
4886
4887 Kernel Prolog
4888 ~~~~~~~~~~~~~
4889
4890 The compiler performs initialization in the kernel prologue depending on the
4891 target and information about things like stack usage in the kernel and called
4892 functions. Some of this initialization requires the compiler to request certain
4893 User and System SGPRs be present in the
4894 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4895 :ref:`amdgpu-amdhsa-kernel-descriptor`.
4896
4897 .. _amdgpu-amdhsa-kernel-prolog-cfi:
4898
4899 CFI
4900 +++
4901
4902 1.  The CFI return address is undefined.
4903
4904 2.  The CFI CFA is defined using an expression which evaluates to a location
4905     description that comprises one memory location description for the
4906     ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4907
4908 .. _amdgpu-amdhsa-kernel-prolog-m0:
4909
4910 M0
4911 ++
4912
4913 GFX6-GFX8
4914   The M0 register must be initialized with a value at least the total LDS size
4915   if the kernel may access LDS via DS or flat operations. Total LDS size is
4916   available in dispatch packet. For M0, it is also possible to use maximum
4917   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4918   GFX7-GFX8).
4919 GFX9-GFX11
4920   The M0 register is not used for range checking LDS accesses and so does not
4921   need to be initialized in the prolog.
4922
4923 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4924
4925 Stack Pointer
4926 +++++++++++++
4927
4928 If the kernel has function calls it must set up the ABI stack pointer described
4929 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4930 SGPR32 to the unswizzled scratch offset of the address past the last local
4931 allocation.
4932
4933 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4934
4935 Frame Pointer
4936 +++++++++++++
4937
4938 If the kernel needs a frame pointer for the reasons defined in
4939 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4940 kernel prolog. If a frame pointer is not required then all uses of the frame
4941 pointer are replaced with immediate ``0`` offsets.
4942
4943 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4944
4945 Flat Scratch
4946 ++++++++++++
4947
4948 There are different methods used for initializing flat scratch:
4949
4950 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4951   specifies *Does not support generic address space*:
4952
4953   Flat scratch is not supported and there is no flat scratch register pair.
4954
4955 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4956   specifies *Offset flat scratch*:
4957
4958   If the kernel or any function it calls may use flat operations to access
4959   scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4960   (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4961   Scratch Wavefront Offset SGPR registers (see
4962   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4963
4964   1. The low word of Flat Scratch Init is the 32-bit byte offset from
4965      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4966      being managed by SPI for the queue executing the kernel dispatch. This is
4967      the same value used in the Scratch Segment Buffer V# base address.
4968
4969      CP obtains this from the runtime. (The Scratch Segment Buffer base address
4970      is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4971
4972      The prolog must add the value of Scratch Wavefront Offset to get the
4973      wavefront's byte scratch backing memory offset from
4974      ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4975
4976      The Scratch Wavefront Offset must also be used as an offset with Private
4977      segment address when using the Scratch Segment Buffer.
4978
4979      Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4980      shifted by 8 before moving into FLAT_SCRATCH_HI.
4981
4982      FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4983      SGPRn is the highest numbered SGPR allocated to the wavefront).
4984      FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4985      added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4986      FLAT SCRATCH BASE in flat memory instructions that access the scratch
4987      aperture.
4988   2. The second word of Flat Scratch Init is 32-bit byte size of a single
4989      work-items scratch memory usage.
4990
4991      CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4992      checks that the value in the kernel dispatch packet Private Segment Byte
4993      Size is not larger and requests the runtime to increase the queue's scratch
4994      size if necessary.
4995
4996      CP directly loads from the kernel dispatch packet Private Segment Byte Size
4997      field and rounds up to a multiple of DWORD. Having CP load it once avoids
4998      loading it at the beginning of every wavefront.
4999
5000      The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
5001      GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
5002      in flat memory instructions.
5003
5004 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5005   specifies *Absolute flat scratch*:
5006
5007   If the kernel or any function it calls may use flat operations to access
5008   scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
5009   (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
5010   uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
5011   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
5012
5013   The Flat Scratch Init is the 64-bit address of the base of scratch backing
5014   memory being managed by SPI for the queue executing the kernel dispatch.
5015
5016   CP obtains this from the runtime.
5017
5018   The kernel prolog must add the value of the wave's Scratch Wavefront Offset
5019   and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
5020   which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
5021   memory instructions.
5022
5023   The Scratch Wavefront Offset must also be used as an offset with Private
5024   segment address when using the Scratch Segment Buffer (see
5025   :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
5026
5027 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5028   specifies *Architected flat scratch*:
5029
5030   If ENABLE_PRIVATE_SEGMENT is enabled in
5031   :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` then the FLAT_SCRATCH
5032   register pair will be initialized to the 64-bit address of the base of scratch
5033   backing memory being managed by SPI for the queue executing the kernel
5034   dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
5035   flat scratch base in flat memory instructions.
5036
5037 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
5038
5039 Private Segment Buffer
5040 ++++++++++++++++++++++
5041
5042 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
5043 *Architected flat scratch* then a Private Segment Buffer is not supported.
5044 Instead the flat SCRATCH instructions are used.
5045
5046 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
5047 that are used as a V# to access scratch. CP uses the value provided by the
5048 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
5049 access the private memory space using a segment address. See
5050 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
5051
5052 The scratch V# is a four-aligned SGPR and always selected for the kernel as
5053 follows:
5054
5055   - If it is known during instruction selection that there is stack usage,
5056     SGPR0-3 is reserved for use as the scratch V#.  Stack usage is assumed if
5057     optimizations are disabled (``-O0``), if stack objects already exist (for
5058     locals, etc.), or if there are any function calls.
5059
5060   - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
5061     are reserved for the tentative scratch V#. These will be used if it is
5062     determined that spilling is needed.
5063
5064     - If no use is made of the tentative scratch V#, then it is unreserved,
5065       and the register count is determined ignoring it.
5066     - If use is made of the tentative scratch V#, then its register numbers
5067       are shifted to the first four-aligned SGPR index after the highest one
5068       allocated by the register allocator, and all uses are updated. The
5069       register count includes them in the shifted location.
5070     - In either case, if the processor has the SGPR allocation bug, the
5071       tentative allocation is not shifted or unreserved in order to ensure
5072       the register count is higher to workaround the bug.
5073
5074     .. note::
5075
5076       This approach of using a tentative scratch V# and shifting the register
5077       numbers if used avoids having to perform register allocation a second
5078       time if the tentative V# is eliminated. This is more efficient and
5079       avoids the problem that the second register allocation may perform
5080       spilling which will fail as there is no longer a scratch V#.
5081
5082 When the kernel prolog code is being emitted it is known whether the scratch V#
5083 described above is actually used. If it is, the prolog code must set it up by
5084 copying the Private Segment Buffer to the scratch V# registers and then adding
5085 the Private Segment Wavefront Offset to the queue base address in the V#. The
5086 result is a V# with a base address pointing to the beginning of the wavefront
5087 scratch backing memory.
5088
5089 The Private Segment Buffer is always requested, but the Private Segment
5090 Wavefront Offset is only requested if it is used (see
5091 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5092
5093 .. _amdgpu-amdhsa-memory-model:
5094
5095 Memory Model
5096 ~~~~~~~~~~~~
5097
5098 This section describes the mapping of the LLVM memory model onto AMDGPU machine
5099 code (see :ref:`memmodel`).
5100
5101 The AMDGPU backend supports the memory synchronization scopes specified in
5102 :ref:`amdgpu-memory-scopes`.
5103
5104 The code sequences used to implement the memory model specify the order of
5105 instructions that a single thread must execute. The ``s_waitcnt`` and cache
5106 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
5107 to other memory instructions executed by the same thread. This allows them to be
5108 moved earlier or later which can allow them to be combined with other instances
5109 of the same instruction, or hoisted/sunk out of loops to improve performance.
5110 Only the instructions related to the memory model are given; additional
5111 ``s_waitcnt`` instructions are required to ensure registers are defined before
5112 being used. These may be able to be combined with the memory model ``s_waitcnt``
5113 instructions as described above.
5114
5115 The AMDGPU backend supports the following memory models:
5116
5117   HSA Memory Model [HSA]_
5118     The HSA memory model uses a single happens-before relation for all address
5119     spaces (see :ref:`amdgpu-address-spaces`).
5120   OpenCL Memory Model [OpenCL]_
5121     The OpenCL memory model which has separate happens-before relations for the
5122     global and local address spaces. Only a fence specifying both global and
5123     local address space, and seq_cst instructions join the relationships. Since
5124     the LLVM ``memfence`` instruction does not allow an address space to be
5125     specified the OpenCL fence has to conservatively assume both local and
5126     global address space was specified. However, optimizations can often be
5127     done to eliminate the additional ``s_waitcnt`` instructions when there are
5128     no intervening memory instructions which access the corresponding address
5129     space. The code sequences in the table indicate what can be omitted for the
5130     OpenCL memory. The target triple environment is used to determine if the
5131     source language is OpenCL (see :ref:`amdgpu-opencl`).
5132
5133 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
5134 operations.
5135
5136 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
5137 termed vector memory operations.
5138
5139 Private address space uses ``buffer_load/store`` using the scratch V#
5140 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
5141 is accessing the memory, atomic memory orderings are not meaningful, and all
5142 accesses are treated as non-atomic.
5143
5144 Constant address space uses ``buffer/global_load`` instructions (or equivalent
5145 scalar memory instructions). Since the constant address space contents do not
5146 change during the execution of a kernel dispatch it is not legal to perform
5147 stores, and atomic memory orderings are not meaningful, and all accesses are
5148 treated as non-atomic.
5149
5150 A memory synchronization scope wider than work-group is not meaningful for the
5151 group (LDS) address space and is treated as work-group.
5152
5153 The memory model does not support the region address space which is treated as
5154 non-atomic.
5155
5156 Acquire memory ordering is not meaningful on store atomic instructions and is
5157 treated as non-atomic.
5158
5159 Release memory ordering is not meaningful on load atomic instructions and is
5160 treated a non-atomic.
5161
5162 Acquire-release memory ordering is not meaningful on load or store atomic
5163 instructions and is treated as acquire and release respectively.
5164
5165 The memory order also adds the single thread optimization constraints defined in
5166 table
5167 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
5168
5169   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
5170      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
5171
5172      ============ ==============================================================
5173      LLVM Memory  Optimization Constraints
5174      Ordering
5175      ============ ==============================================================
5176      unordered    *none*
5177      monotonic    *none*
5178      acquire      - If a load atomic/atomicrmw then no following load/load
5179                     atomic/store/store atomic/atomicrmw/fence instruction can be
5180                     moved before the acquire.
5181                   - If a fence then same as load atomic, plus no preceding
5182                     associated fence-paired-atomic can be moved after the fence.
5183      release      - If a store atomic/atomicrmw then no preceding load/load
5184                     atomic/store/store atomic/atomicrmw/fence instruction can be
5185                     moved after the release.
5186                   - If a fence then same as store atomic, plus no following
5187                     associated fence-paired-atomic can be moved before the
5188                     fence.
5189      acq_rel      Same constraints as both acquire and release.
5190      seq_cst      - If a load atomic then same constraints as acquire, plus no
5191                     preceding sequentially consistent load atomic/store
5192                     atomic/atomicrmw/fence instruction can be moved after the
5193                     seq_cst.
5194                   - If a store atomic then the same constraints as release, plus
5195                     no following sequentially consistent load atomic/store
5196                     atomic/atomicrmw/fence instruction can be moved before the
5197                     seq_cst.
5198                   - If an atomicrmw/fence then same constraints as acq_rel.
5199      ============ ==============================================================
5200
5201 The code sequences used to implement the memory model are defined in the
5202 following sections:
5203
5204 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5205 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5206 * :ref:`amdgpu-amdhsa-memory-model-gfx940`
5207 * :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
5208
5209 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5210
5211 Memory Model GFX6-GFX9
5212 ++++++++++++++++++++++
5213
5214 For GFX6-GFX9:
5215
5216 * Each agent has multiple shader arrays (SA).
5217 * Each SA has multiple compute units (CU).
5218 * Each CU has multiple SIMDs that execute wavefronts.
5219 * The wavefronts for a single work-group are executed in the same CU but may be
5220   executed by different SIMDs.
5221 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
5222   executing on it.
5223 * All LDS operations of a CU are performed as wavefront wide operations in a
5224   global order and involve no caching. Completion is reported to a wavefront in
5225   execution order.
5226 * The LDS memory has multiple request queues shared by the SIMDs of a
5227   CU. Therefore, the LDS operations performed by different wavefronts of a
5228   work-group can be reordered relative to each other, which can result in
5229   reordering the visibility of vector memory operations with respect to LDS
5230   operations of other wavefronts in the same work-group. A ``s_waitcnt
5231   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5232   vector memory operations between wavefronts of a work-group, but not between
5233   operations performed by the same wavefront.
5234 * The vector memory operations are performed as wavefront wide operations and
5235   completion is reported to a wavefront in execution order. The exception is
5236   that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5237   vector memory order if they access LDS memory, and out of LDS operation order
5238   if they access global memory.
5239 * The vector memory operations access a single vector L1 cache shared by all
5240   SIMDs a CU. Therefore, no special action is required for coherence between the
5241   lanes of a single wavefront, or for coherence between wavefronts in the same
5242   work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5243   wavefronts executing in different work-groups as they may be executing on
5244   different CUs.
5245 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
5246   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5247   scalar operations are used in a restricted way so do not impact the memory
5248   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5249 * The vector and scalar memory operations use an L2 cache shared by all CUs on
5250   the same agent.
5251 * The L2 cache has independent channels to service disjoint ranges of virtual
5252   addresses.
5253 * Each CU has a separate request queue per channel. Therefore, the vector and
5254   scalar memory operations performed by wavefronts executing in different
5255   work-groups (which may be executing on different CUs) of an agent can be
5256   reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
5257   ensure synchronization between vector memory operations of different CUs. It
5258   ensures a previous vector memory operation has completed before executing a
5259   subsequent vector memory or LDS operation and so can be used to meet the
5260   requirements of acquire and release.
5261 * The L2 cache can be kept coherent with other agents on some targets, or ranges
5262   of virtual addresses can be set up to bypass it to ensure system coherence.
5263
5264 Scalar memory operations are only used to access memory that is proven to not
5265 change during the execution of the kernel dispatch. This includes constant
5266 address space and global address space for program scope ``const`` variables.
5267 Therefore, the kernel machine code does not have to maintain the scalar cache to
5268 ensure it is coherent with the vector caches. The scalar and vector caches are
5269 invalidated between kernel dispatches by CP since constant address space data
5270 may change between kernel dispatch executions. See
5271 :ref:`amdgpu-amdhsa-memory-spaces`.
5272
5273 The one exception is if scalar writes are used to spill SGPR registers. In this
5274 case the AMDGPU backend ensures the memory location used to spill is never
5275 accessed by vector memory operations at the same time. If scalar writes are used
5276 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
5277 return since the locations may be used for vector memory instructions by a
5278 future wavefront that uses the same scratch area, or a function call that
5279 creates a frame at the same address, respectively. There is no need for a
5280 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
5281
5282 For kernarg backing memory:
5283
5284 * CP invalidates the L1 cache at the start of each kernel dispatch.
5285 * On dGPU the kernarg backing memory is allocated in host memory accessed as
5286   MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
5287   causes it to be treated as non-volatile and so is not invalidated by
5288   ``*_vol``.
5289 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
5290   and so the L2 cache will be coherent with the CPU and other agents.
5291
5292 Scratch backing memory (which is used for the private address space) is accessed
5293 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
5294 only accessed by a single thread, and is always write-before-read, there is
5295 never a need to invalidate these entries from the L1 cache. Hence all cache
5296 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
5297
5298 The code sequences used to implement the memory model for GFX6-GFX9 are defined
5299 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
5300
5301   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
5302      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
5303
5304      ============ ============ ============== ========== ================================
5305      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
5306                   Ordering     Sync Scope     Address    GFX6-GFX9
5307                                               Space
5308      ============ ============ ============== ========== ================================
5309      **Non-Atomic**
5310      ------------------------------------------------------------------------------------
5311      load         *none*       *none*         - global   - !volatile & !nontemporal
5312                                               - generic
5313                                               - private    1. buffer/global/flat_load
5314                                               - constant
5315                                                          - !volatile & nontemporal
5316
5317                                                            1. buffer/global/flat_load
5318                                                               glc=1 slc=1
5319
5320                                                          - volatile
5321
5322                                                            1. buffer/global/flat_load
5323                                                               glc=1
5324                                                            2. s_waitcnt vmcnt(0)
5325
5326                                                             - Must happen before
5327                                                               any following volatile
5328                                                               global/generic
5329                                                               load/store.
5330                                                             - Ensures that
5331                                                               volatile
5332                                                               operations to
5333                                                               different
5334                                                               addresses will not
5335                                                               be reordered by
5336                                                               hardware.
5337
5338      load         *none*       *none*         - local    1. ds_load
5339      store        *none*       *none*         - global   - !volatile & !nontemporal
5340                                               - generic
5341                                               - private    1. buffer/global/flat_store
5342                                               - constant
5343                                                          - !volatile & nontemporal
5344
5345                                                            1. buffer/global/flat_store
5346                                                               glc=1 slc=1
5347
5348                                                          - volatile
5349
5350                                                            1. buffer/global/flat_store
5351                                                            2. s_waitcnt vmcnt(0)
5352
5353                                                             - Must happen before
5354                                                               any following volatile
5355                                                               global/generic
5356                                                               load/store.
5357                                                             - Ensures that
5358                                                               volatile
5359                                                               operations to
5360                                                               different
5361                                                               addresses will not
5362                                                               be reordered by
5363                                                               hardware.
5364
5365      store        *none*       *none*         - local    1. ds_store
5366      **Unordered Atomic**
5367      ------------------------------------------------------------------------------------
5368      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
5369      store atomic unordered    *any*          *any*      *Same as non-atomic*.
5370      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
5371      **Monotonic Atomic**
5372      ------------------------------------------------------------------------------------
5373      load atomic  monotonic    - singlethread - global   1. buffer/global/ds/flat_load
5374                                - wavefront    - local
5375                                - workgroup    - generic
5376      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
5377                                - system       - generic     glc=1
5378      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
5379                                - wavefront    - generic
5380                                - workgroup
5381                                - agent
5382                                - system
5383      store atomic monotonic    - singlethread - local    1. ds_store
5384                                - wavefront
5385                                - workgroup
5386      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
5387                                - wavefront    - generic
5388                                - workgroup
5389                                - agent
5390                                - system
5391      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
5392                                - wavefront
5393                                - workgroup
5394      **Acquire Atomic**
5395      ------------------------------------------------------------------------------------
5396      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
5397                                - wavefront    - local
5398                                               - generic
5399      load atomic  acquire      - workgroup    - global   1. buffer/global_load
5400      load atomic  acquire      - workgroup    - local    1. ds/flat_load
5401                                               - generic  2. s_waitcnt lgkmcnt(0)
5402
5403                                                            - If OpenCL, omit.
5404                                                            - Must happen before
5405                                                              any following
5406                                                              global/generic
5407                                                              load/load
5408                                                              atomic/store/store
5409                                                              atomic/atomicrmw.
5410                                                            - Ensures any
5411                                                              following global
5412                                                              data read is no
5413                                                              older than a local load
5414                                                              atomic value being
5415                                                              acquired.
5416
5417      load atomic  acquire      - agent        - global   1. buffer/global_load
5418                                - system                     glc=1
5419                                                          2. s_waitcnt vmcnt(0)
5420
5421                                                            - Must happen before
5422                                                              following
5423                                                              buffer_wbinvl1_vol.
5424                                                            - Ensures the load
5425                                                              has completed
5426                                                              before invalidating
5427                                                              the cache.
5428
5429                                                          3. buffer_wbinvl1_vol
5430
5431                                                            - Must happen before
5432                                                              any following
5433                                                              global/generic
5434                                                              load/load
5435                                                              atomic/atomicrmw.
5436                                                            - Ensures that
5437                                                              following
5438                                                              loads will not see
5439                                                              stale global data.
5440
5441      load atomic  acquire      - agent        - generic  1. flat_load glc=1
5442                                - system                  2. s_waitcnt vmcnt(0) &
5443                                                             lgkmcnt(0)
5444
5445                                                            - If OpenCL omit
5446                                                              lgkmcnt(0).
5447                                                            - Must happen before
5448                                                              following
5449                                                              buffer_wbinvl1_vol.
5450                                                            - Ensures the flat_load
5451                                                              has completed
5452                                                              before invalidating
5453                                                              the cache.
5454
5455                                                          3. buffer_wbinvl1_vol
5456
5457                                                            - Must happen before
5458                                                              any following
5459                                                              global/generic
5460                                                              load/load
5461                                                              atomic/atomicrmw.
5462                                                            - Ensures that
5463                                                              following loads
5464                                                              will not see stale
5465                                                              global data.
5466
5467      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
5468                                - wavefront    - local
5469                                               - generic
5470      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
5471      atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
5472                                               - generic  2. s_waitcnt lgkmcnt(0)
5473
5474                                                            - If OpenCL, omit.
5475                                                            - Must happen before
5476                                                              any following
5477                                                              global/generic
5478                                                              load/load
5479                                                              atomic/store/store
5480                                                              atomic/atomicrmw.
5481                                                            - Ensures any
5482                                                              following global
5483                                                              data read is no
5484                                                              older than a local
5485                                                              atomicrmw value
5486                                                              being acquired.
5487
5488      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
5489                                - system                  2. s_waitcnt vmcnt(0)
5490
5491                                                            - Must happen before
5492                                                              following
5493                                                              buffer_wbinvl1_vol.
5494                                                            - Ensures the
5495                                                              atomicrmw has
5496                                                              completed before
5497                                                              invalidating the
5498                                                              cache.
5499
5500                                                          3. buffer_wbinvl1_vol
5501
5502                                                            - Must happen before
5503                                                              any following
5504                                                              global/generic
5505                                                              load/load
5506                                                              atomic/atomicrmw.
5507                                                            - Ensures that
5508                                                              following loads
5509                                                              will not see stale
5510                                                              global data.
5511
5512      atomicrmw    acquire      - agent        - generic  1. flat_atomic
5513                                - system                  2. s_waitcnt vmcnt(0) &
5514                                                             lgkmcnt(0)
5515
5516                                                            - If OpenCL, omit
5517                                                              lgkmcnt(0).
5518                                                            - Must happen before
5519                                                              following
5520                                                              buffer_wbinvl1_vol.
5521                                                            - Ensures the
5522                                                              atomicrmw has
5523                                                              completed before
5524                                                              invalidating the
5525                                                              cache.
5526
5527                                                          3. buffer_wbinvl1_vol
5528
5529                                                            - Must happen before
5530                                                              any following
5531                                                              global/generic
5532                                                              load/load
5533                                                              atomic/atomicrmw.
5534                                                            - Ensures that
5535                                                              following loads
5536                                                              will not see stale
5537                                                              global data.
5538
5539      fence        acquire      - singlethread *none*     *none*
5540                                - wavefront
5541      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5542
5543                                                            - If OpenCL and
5544                                                              address space is
5545                                                              not generic, omit.
5546                                                            - However, since LLVM
5547                                                              currently has no
5548                                                              address space on
5549                                                              the fence need to
5550                                                              conservatively
5551                                                              always generate. If
5552                                                              fence had an
5553                                                              address space then
5554                                                              set to address
5555                                                              space of OpenCL
5556                                                              fence flag, or to
5557                                                              generic if both
5558                                                              local and global
5559                                                              flags are
5560                                                              specified.
5561                                                            - Must happen after
5562                                                              any preceding
5563                                                              local/generic load
5564                                                              atomic/atomicrmw
5565                                                              with an equal or
5566                                                              wider sync scope
5567                                                              and memory ordering
5568                                                              stronger than
5569                                                              unordered (this is
5570                                                              termed the
5571                                                              fence-paired-atomic).
5572                                                            - Must happen before
5573                                                              any following
5574                                                              global/generic
5575                                                              load/load
5576                                                              atomic/store/store
5577                                                              atomic/atomicrmw.
5578                                                            - Ensures any
5579                                                              following global
5580                                                              data read is no
5581                                                              older than the
5582                                                              value read by the
5583                                                              fence-paired-atomic.
5584
5585      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5586                                - system                     vmcnt(0)
5587
5588                                                            - If OpenCL and
5589                                                              address space is
5590                                                              not generic, omit
5591                                                              lgkmcnt(0).
5592                                                            - However, since LLVM
5593                                                              currently has no
5594                                                              address space on
5595                                                              the fence need to
5596                                                              conservatively
5597                                                              always generate
5598                                                              (see comment for
5599                                                              previous fence).
5600                                                            - Could be split into
5601                                                              separate s_waitcnt
5602                                                              vmcnt(0) and
5603                                                              s_waitcnt
5604                                                              lgkmcnt(0) to allow
5605                                                              them to be
5606                                                              independently moved
5607                                                              according to the
5608                                                              following rules.
5609                                                            - s_waitcnt vmcnt(0)
5610                                                              must happen after
5611                                                              any preceding
5612                                                              global/generic load
5613                                                              atomic/atomicrmw
5614                                                              with an equal or
5615                                                              wider sync scope
5616                                                              and memory ordering
5617                                                              stronger than
5618                                                              unordered (this is
5619                                                              termed the
5620                                                              fence-paired-atomic).
5621                                                            - s_waitcnt lgkmcnt(0)
5622                                                              must happen after
5623                                                              any preceding
5624                                                              local/generic load
5625                                                              atomic/atomicrmw
5626                                                              with an equal or
5627                                                              wider sync scope
5628                                                              and memory ordering
5629                                                              stronger than
5630                                                              unordered (this is
5631                                                              termed the
5632                                                              fence-paired-atomic).
5633                                                            - Must happen before
5634                                                              the following
5635                                                              buffer_wbinvl1_vol.
5636                                                            - Ensures that the
5637                                                              fence-paired atomic
5638                                                              has completed
5639                                                              before invalidating
5640                                                              the
5641                                                              cache. Therefore
5642                                                              any following
5643                                                              locations read must
5644                                                              be no older than
5645                                                              the value read by
5646                                                              the
5647                                                              fence-paired-atomic.
5648
5649                                                          2. buffer_wbinvl1_vol
5650
5651                                                            - Must happen before any
5652                                                              following global/generic
5653                                                              load/load
5654                                                              atomic/store/store
5655                                                              atomic/atomicrmw.
5656                                                            - Ensures that
5657                                                              following loads
5658                                                              will not see stale
5659                                                              global data.
5660
5661      **Release Atomic**
5662      ------------------------------------------------------------------------------------
5663      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
5664                                - wavefront    - local
5665                                               - generic
5666      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5667                                               - generic
5668                                                            - If OpenCL, omit.
5669                                                            - Must happen after
5670                                                              any preceding
5671                                                              local/generic
5672                                                              load/store/load
5673                                                              atomic/store
5674                                                              atomic/atomicrmw.
5675                                                            - Must happen before
5676                                                              the following
5677                                                              store.
5678                                                            - Ensures that all
5679                                                              memory operations
5680                                                              to local have
5681                                                              completed before
5682                                                              performing the
5683                                                              store that is being
5684                                                              released.
5685
5686                                                          2. buffer/global/flat_store
5687      store atomic release      - workgroup    - local    1. ds_store
5688      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5689                                - system       - generic     vmcnt(0)
5690
5691                                                            - If OpenCL and
5692                                                              address space is
5693                                                              not generic, omit
5694                                                              lgkmcnt(0).
5695                                                            - Could be split into
5696                                                              separate s_waitcnt
5697                                                              vmcnt(0) and
5698                                                              s_waitcnt
5699                                                              lgkmcnt(0) to allow
5700                                                              them to be
5701                                                              independently moved
5702                                                              according to the
5703                                                              following rules.
5704                                                            - s_waitcnt vmcnt(0)
5705                                                              must happen after
5706                                                              any preceding
5707                                                              global/generic
5708                                                              load/store/load
5709                                                              atomic/store
5710                                                              atomic/atomicrmw.
5711                                                            - s_waitcnt lgkmcnt(0)
5712                                                              must happen after
5713                                                              any preceding
5714                                                              local/generic
5715                                                              load/store/load
5716                                                              atomic/store
5717                                                              atomic/atomicrmw.
5718                                                            - Must happen before
5719                                                              the following
5720                                                              store.
5721                                                            - Ensures that all
5722                                                              memory operations
5723                                                              to memory have
5724                                                              completed before
5725                                                              performing the
5726                                                              store that is being
5727                                                              released.
5728
5729                                                          2. buffer/global/flat_store
5730      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
5731                                - wavefront    - local
5732                                               - generic
5733      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5734                                               - generic
5735                                                            - If OpenCL, omit.
5736                                                            - Must happen after
5737                                                              any preceding
5738                                                              local/generic
5739                                                              load/store/load
5740                                                              atomic/store
5741                                                              atomic/atomicrmw.
5742                                                            - Must happen before
5743                                                              the following
5744                                                              atomicrmw.
5745                                                            - Ensures that all
5746                                                              memory operations
5747                                                              to local have
5748                                                              completed before
5749                                                              performing the
5750                                                              atomicrmw that is
5751                                                              being released.
5752
5753                                                          2. buffer/global/flat_atomic
5754      atomicrmw    release      - workgroup    - local    1. ds_atomic
5755      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5756                                - system       - generic     vmcnt(0)
5757
5758                                                            - If OpenCL, omit
5759                                                              lgkmcnt(0).
5760                                                            - Could be split into
5761                                                              separate s_waitcnt
5762                                                              vmcnt(0) and
5763                                                              s_waitcnt
5764                                                              lgkmcnt(0) to allow
5765                                                              them to be
5766                                                              independently moved
5767                                                              according to the
5768                                                              following rules.
5769                                                            - s_waitcnt vmcnt(0)
5770                                                              must happen after
5771                                                              any preceding
5772                                                              global/generic
5773                                                              load/store/load
5774                                                              atomic/store
5775                                                              atomic/atomicrmw.
5776                                                            - s_waitcnt lgkmcnt(0)
5777                                                              must happen after
5778                                                              any preceding
5779                                                              local/generic
5780                                                              load/store/load
5781                                                              atomic/store
5782                                                              atomic/atomicrmw.
5783                                                            - Must happen before
5784                                                              the following
5785                                                              atomicrmw.
5786                                                            - Ensures that all
5787                                                              memory operations
5788                                                              to global and local
5789                                                              have completed
5790                                                              before performing
5791                                                              the atomicrmw that
5792                                                              is being released.
5793
5794                                                          2. buffer/global/flat_atomic
5795      fence        release      - singlethread *none*     *none*
5796                                - wavefront
5797      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5798
5799                                                            - If OpenCL and
5800                                                              address space is
5801                                                              not generic, omit.
5802                                                            - However, since LLVM
5803                                                              currently has no
5804                                                              address space on
5805                                                              the fence need to
5806                                                              conservatively
5807                                                              always generate. If
5808                                                              fence had an
5809                                                              address space then
5810                                                              set to address
5811                                                              space of OpenCL
5812                                                              fence flag, or to
5813                                                              generic if both
5814                                                              local and global
5815                                                              flags are
5816                                                              specified.
5817                                                            - Must happen after
5818                                                              any preceding
5819                                                              local/generic
5820                                                              load/load
5821                                                              atomic/store/store
5822                                                              atomic/atomicrmw.
5823                                                            - Must happen before
5824                                                              any following store
5825                                                              atomic/atomicrmw
5826                                                              with an equal or
5827                                                              wider sync scope
5828                                                              and memory ordering
5829                                                              stronger than
5830                                                              unordered (this is
5831                                                              termed the
5832                                                              fence-paired-atomic).
5833                                                            - Ensures that all
5834                                                              memory operations
5835                                                              to local have
5836                                                              completed before
5837                                                              performing the
5838                                                              following
5839                                                              fence-paired-atomic.
5840
5841      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5842                                - system                     vmcnt(0)
5843
5844                                                            - If OpenCL and
5845                                                              address space is
5846                                                              not generic, omit
5847                                                              lgkmcnt(0).
5848                                                            - If OpenCL and
5849                                                              address space is
5850                                                              local, omit
5851                                                              vmcnt(0).
5852                                                            - However, since LLVM
5853                                                              currently has no
5854                                                              address space on
5855                                                              the fence need to
5856                                                              conservatively
5857                                                              always generate. If
5858                                                              fence had an
5859                                                              address space then
5860                                                              set to address
5861                                                              space of OpenCL
5862                                                              fence flag, or to
5863                                                              generic if both
5864                                                              local and global
5865                                                              flags are
5866                                                              specified.
5867                                                            - Could be split into
5868                                                              separate s_waitcnt
5869                                                              vmcnt(0) and
5870                                                              s_waitcnt
5871                                                              lgkmcnt(0) to allow
5872                                                              them to be
5873                                                              independently moved
5874                                                              according to the
5875                                                              following rules.
5876                                                            - s_waitcnt vmcnt(0)
5877                                                              must happen after
5878                                                              any preceding
5879                                                              global/generic
5880                                                              load/store/load
5881                                                              atomic/store
5882                                                              atomic/atomicrmw.
5883                                                            - s_waitcnt lgkmcnt(0)
5884                                                              must happen after
5885                                                              any preceding
5886                                                              local/generic
5887                                                              load/store/load
5888                                                              atomic/store
5889                                                              atomic/atomicrmw.
5890                                                            - Must happen before
5891                                                              any following store
5892                                                              atomic/atomicrmw
5893                                                              with an equal or
5894                                                              wider sync scope
5895                                                              and memory ordering
5896                                                              stronger than
5897                                                              unordered (this is
5898                                                              termed the
5899                                                              fence-paired-atomic).
5900                                                            - Ensures that all
5901                                                              memory operations
5902                                                              have
5903                                                              completed before
5904                                                              performing the
5905                                                              following
5906                                                              fence-paired-atomic.
5907
5908      **Acquire-Release Atomic**
5909      ------------------------------------------------------------------------------------
5910      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
5911                                - wavefront    - local
5912                                               - generic
5913      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5914
5915                                                            - If OpenCL, omit.
5916                                                            - Must happen after
5917                                                              any preceding
5918                                                              local/generic
5919                                                              load/store/load
5920                                                              atomic/store
5921                                                              atomic/atomicrmw.
5922                                                            - Must happen before
5923                                                              the following
5924                                                              atomicrmw.
5925                                                            - Ensures that all
5926                                                              memory operations
5927                                                              to local have
5928                                                              completed before
5929                                                              performing the
5930                                                              atomicrmw that is
5931                                                              being released.
5932
5933                                                          2. buffer/global_atomic
5934
5935      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
5936                                                          2. s_waitcnt lgkmcnt(0)
5937
5938                                                            - If OpenCL, omit.
5939                                                            - Must happen before
5940                                                              any following
5941                                                              global/generic
5942                                                              load/load
5943                                                              atomic/store/store
5944                                                              atomic/atomicrmw.
5945                                                            - Ensures any
5946                                                              following global
5947                                                              data read is no
5948                                                              older than the local load
5949                                                              atomic value being
5950                                                              acquired.
5951
5952      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
5953
5954                                                            - If OpenCL, omit.
5955                                                            - Must happen after
5956                                                              any preceding
5957                                                              local/generic
5958                                                              load/store/load
5959                                                              atomic/store
5960                                                              atomic/atomicrmw.
5961                                                            - Must happen before
5962                                                              the following
5963                                                              atomicrmw.
5964                                                            - Ensures that all
5965                                                              memory operations
5966                                                              to local have
5967                                                              completed before
5968                                                              performing the
5969                                                              atomicrmw that is
5970                                                              being released.
5971
5972                                                          2. flat_atomic
5973                                                          3. s_waitcnt lgkmcnt(0)
5974
5975                                                            - If OpenCL, omit.
5976                                                            - Must happen before
5977                                                              any following
5978                                                              global/generic
5979                                                              load/load
5980                                                              atomic/store/store
5981                                                              atomic/atomicrmw.
5982                                                            - Ensures any
5983                                                              following global
5984                                                              data read is no
5985                                                              older than a local load
5986                                                              atomic value being
5987                                                              acquired.
5988
5989      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5990                                - system                     vmcnt(0)
5991
5992                                                            - If OpenCL, omit
5993                                                              lgkmcnt(0).
5994                                                            - Could be split into
5995                                                              separate s_waitcnt
5996                                                              vmcnt(0) and
5997                                                              s_waitcnt
5998                                                              lgkmcnt(0) to allow
5999                                                              them to be
6000                                                              independently moved
6001                                                              according to the
6002                                                              following rules.
6003                                                            - s_waitcnt vmcnt(0)
6004                                                              must happen after
6005                                                              any preceding
6006                                                              global/generic
6007                                                              load/store/load
6008                                                              atomic/store
6009                                                              atomic/atomicrmw.
6010                                                            - s_waitcnt lgkmcnt(0)
6011                                                              must happen after
6012                                                              any preceding
6013                                                              local/generic
6014                                                              load/store/load
6015                                                              atomic/store
6016                                                              atomic/atomicrmw.
6017                                                            - Must happen before
6018                                                              the following
6019                                                              atomicrmw.
6020                                                            - Ensures that all
6021                                                              memory operations
6022                                                              to global have
6023                                                              completed before
6024                                                              performing the
6025                                                              atomicrmw that is
6026                                                              being released.
6027
6028                                                          2. buffer/global_atomic
6029                                                          3. s_waitcnt vmcnt(0)
6030
6031                                                            - Must happen before
6032                                                              following
6033                                                              buffer_wbinvl1_vol.
6034                                                            - Ensures the
6035                                                              atomicrmw has
6036                                                              completed before
6037                                                              invalidating the
6038                                                              cache.
6039
6040                                                          4. buffer_wbinvl1_vol
6041
6042                                                            - Must happen before
6043                                                              any following
6044                                                              global/generic
6045                                                              load/load
6046                                                              atomic/atomicrmw.
6047                                                            - Ensures that
6048                                                              following loads
6049                                                              will not see stale
6050                                                              global data.
6051
6052      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
6053                                - system                     vmcnt(0)
6054
6055                                                            - If OpenCL, omit
6056                                                              lgkmcnt(0).
6057                                                            - Could be split into
6058                                                              separate s_waitcnt
6059                                                              vmcnt(0) and
6060                                                              s_waitcnt
6061                                                              lgkmcnt(0) to allow
6062                                                              them to be
6063                                                              independently moved
6064                                                              according to the
6065                                                              following rules.
6066                                                            - s_waitcnt vmcnt(0)
6067                                                              must happen after
6068                                                              any preceding
6069                                                              global/generic
6070                                                              load/store/load
6071                                                              atomic/store
6072                                                              atomic/atomicrmw.
6073                                                            - s_waitcnt lgkmcnt(0)
6074                                                              must happen after
6075                                                              any preceding
6076                                                              local/generic
6077                                                              load/store/load
6078                                                              atomic/store
6079                                                              atomic/atomicrmw.
6080                                                            - Must happen before
6081                                                              the following
6082                                                              atomicrmw.
6083                                                            - Ensures that all
6084                                                              memory operations
6085                                                              to global have
6086                                                              completed before
6087                                                              performing the
6088                                                              atomicrmw that is
6089                                                              being released.
6090
6091                                                          2. flat_atomic
6092                                                          3. s_waitcnt vmcnt(0) &
6093                                                             lgkmcnt(0)
6094
6095                                                            - If OpenCL, omit
6096                                                              lgkmcnt(0).
6097                                                            - Must happen before
6098                                                              following
6099                                                              buffer_wbinvl1_vol.
6100                                                            - Ensures the
6101                                                              atomicrmw has
6102                                                              completed before
6103                                                              invalidating the
6104                                                              cache.
6105
6106                                                          4. buffer_wbinvl1_vol
6107
6108                                                            - Must happen before
6109                                                              any following
6110                                                              global/generic
6111                                                              load/load
6112                                                              atomic/atomicrmw.
6113                                                            - Ensures that
6114                                                              following loads
6115                                                              will not see stale
6116                                                              global data.
6117
6118      fence        acq_rel      - singlethread *none*     *none*
6119                                - wavefront
6120      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
6121
6122                                                            - If OpenCL and
6123                                                              address space is
6124                                                              not generic, omit.
6125                                                            - However,
6126                                                              since LLVM
6127                                                              currently has no
6128                                                              address space on
6129                                                              the fence need to
6130                                                              conservatively
6131                                                              always generate
6132                                                              (see comment for
6133                                                              previous fence).
6134                                                            - Must happen after
6135                                                              any preceding
6136                                                              local/generic
6137                                                              load/load
6138                                                              atomic/store/store
6139                                                              atomic/atomicrmw.
6140                                                            - Must happen before
6141                                                              any following
6142                                                              global/generic
6143                                                              load/load
6144                                                              atomic/store/store
6145                                                              atomic/atomicrmw.
6146                                                            - Ensures that all
6147                                                              memory operations
6148                                                              to local have
6149                                                              completed before
6150                                                              performing any
6151                                                              following global
6152                                                              memory operations.
6153                                                            - Ensures that the
6154                                                              preceding
6155                                                              local/generic load
6156                                                              atomic/atomicrmw
6157                                                              with an equal or
6158                                                              wider sync scope
6159                                                              and memory ordering
6160                                                              stronger than
6161                                                              unordered (this is
6162                                                              termed the
6163                                                              acquire-fence-paired-atomic)
6164                                                              has completed
6165                                                              before following
6166                                                              global memory
6167                                                              operations. This
6168                                                              satisfies the
6169                                                              requirements of
6170                                                              acquire.
6171                                                            - Ensures that all
6172                                                              previous memory
6173                                                              operations have
6174                                                              completed before a
6175                                                              following
6176                                                              local/generic store
6177                                                              atomic/atomicrmw
6178                                                              with an equal or
6179                                                              wider sync scope
6180                                                              and memory ordering
6181                                                              stronger than
6182                                                              unordered (this is
6183                                                              termed the
6184                                                              release-fence-paired-atomic).
6185                                                              This satisfies the
6186                                                              requirements of
6187                                                              release.
6188
6189      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
6190                                - system                     vmcnt(0)
6191
6192                                                            - If OpenCL and
6193                                                              address space is
6194                                                              not generic, omit
6195                                                              lgkmcnt(0).
6196                                                            - However, since LLVM
6197                                                              currently has no
6198                                                              address space on
6199                                                              the fence need to
6200                                                              conservatively
6201                                                              always generate
6202                                                              (see comment for
6203                                                              previous fence).
6204                                                            - Could be split into
6205                                                              separate s_waitcnt
6206                                                              vmcnt(0) and
6207                                                              s_waitcnt
6208                                                              lgkmcnt(0) to allow
6209                                                              them to be
6210                                                              independently moved
6211                                                              according to the
6212                                                              following rules.
6213                                                            - s_waitcnt vmcnt(0)
6214                                                              must happen after
6215                                                              any preceding
6216                                                              global/generic
6217                                                              load/store/load
6218                                                              atomic/store
6219                                                              atomic/atomicrmw.
6220                                                            - s_waitcnt lgkmcnt(0)
6221                                                              must happen after
6222                                                              any preceding
6223                                                              local/generic
6224                                                              load/store/load
6225                                                              atomic/store
6226                                                              atomic/atomicrmw.
6227                                                            - Must happen before
6228                                                              the following
6229                                                              buffer_wbinvl1_vol.
6230                                                            - Ensures that the
6231                                                              preceding
6232                                                              global/local/generic
6233                                                              load
6234                                                              atomic/atomicrmw
6235                                                              with an equal or
6236                                                              wider sync scope
6237                                                              and memory ordering
6238                                                              stronger than
6239                                                              unordered (this is
6240                                                              termed the
6241                                                              acquire-fence-paired-atomic)
6242                                                              has completed
6243                                                              before invalidating
6244                                                              the cache. This
6245                                                              satisfies the
6246                                                              requirements of
6247                                                              acquire.
6248                                                            - Ensures that all
6249                                                              previous memory
6250                                                              operations have
6251                                                              completed before a
6252                                                              following
6253                                                              global/local/generic
6254                                                              store
6255                                                              atomic/atomicrmw
6256                                                              with an equal or
6257                                                              wider sync scope
6258                                                              and memory ordering
6259                                                              stronger than
6260                                                              unordered (this is
6261                                                              termed the
6262                                                              release-fence-paired-atomic).
6263                                                              This satisfies the
6264                                                              requirements of
6265                                                              release.
6266
6267                                                          2. buffer_wbinvl1_vol
6268
6269                                                            - Must happen before
6270                                                              any following
6271                                                              global/generic
6272                                                              load/load
6273                                                              atomic/store/store
6274                                                              atomic/atomicrmw.
6275                                                            - Ensures that
6276                                                              following loads
6277                                                              will not see stale
6278                                                              global data. This
6279                                                              satisfies the
6280                                                              requirements of
6281                                                              acquire.
6282
6283      **Sequential Consistent Atomic**
6284      ------------------------------------------------------------------------------------
6285      load atomic  seq_cst      - singlethread - global   *Same as corresponding
6286                                - wavefront    - local    load atomic acquire,
6287                                               - generic  except must generate
6288                                                          all instructions even
6289                                                          for OpenCL.*
6290      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
6291                                               - generic
6292
6293                                                            - Must
6294                                                              happen after
6295                                                              preceding
6296                                                              local/generic load
6297                                                              atomic/store
6298                                                              atomic/atomicrmw
6299                                                              with memory
6300                                                              ordering of seq_cst
6301                                                              and with equal or
6302                                                              wider sync scope.
6303                                                              (Note that seq_cst
6304                                                              fences have their
6305                                                              own s_waitcnt
6306                                                              lgkmcnt(0) and so do
6307                                                              not need to be
6308                                                              considered.)
6309                                                            - Ensures any
6310                                                              preceding
6311                                                              sequential
6312                                                              consistent local
6313                                                              memory instructions
6314                                                              have completed
6315                                                              before executing
6316                                                              this sequentially
6317                                                              consistent
6318                                                              instruction. This
6319                                                              prevents reordering
6320                                                              a seq_cst store
6321                                                              followed by a
6322                                                              seq_cst load. (Note
6323                                                              that seq_cst is
6324                                                              stronger than
6325                                                              acquire/release as
6326                                                              the reordering of
6327                                                              load acquire
6328                                                              followed by a store
6329                                                              release is
6330                                                              prevented by the
6331                                                              s_waitcnt of
6332                                                              the release, but
6333                                                              there is nothing
6334                                                              preventing a store
6335                                                              release followed by
6336                                                              load acquire from
6337                                                              completing out of
6338                                                              order. The s_waitcnt
6339                                                              could be placed after
6340                                                              seq_store or before
6341                                                              the seq_load. We
6342                                                              choose the load to
6343                                                              make the s_waitcnt be
6344                                                              as late as possible
6345                                                              so that the store
6346                                                              may have already
6347                                                              completed.)
6348
6349                                                          2. *Following
6350                                                             instructions same as
6351                                                             corresponding load
6352                                                             atomic acquire,
6353                                                             except must generate
6354                                                             all instructions even
6355                                                             for OpenCL.*
6356      load atomic  seq_cst      - workgroup    - local    *Same as corresponding
6357                                                          load atomic acquire,
6358                                                          except must generate
6359                                                          all instructions even
6360                                                          for OpenCL.*
6361
6362      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
6363                                - system       - generic     vmcnt(0)
6364
6365                                                            - Could be split into
6366                                                              separate s_waitcnt
6367                                                              vmcnt(0)
6368                                                              and s_waitcnt
6369                                                              lgkmcnt(0) to allow
6370                                                              them to be
6371                                                              independently moved
6372                                                              according to the
6373                                                              following rules.
6374                                                            - s_waitcnt lgkmcnt(0)
6375                                                              must happen after
6376                                                              preceding
6377                                                              global/generic load
6378                                                              atomic/store
6379                                                              atomic/atomicrmw
6380                                                              with memory
6381                                                              ordering of seq_cst
6382                                                              and with equal or
6383                                                              wider sync scope.
6384                                                              (Note that seq_cst
6385                                                              fences have their
6386                                                              own s_waitcnt
6387                                                              lgkmcnt(0) and so do
6388                                                              not need to be
6389                                                              considered.)
6390                                                            - s_waitcnt vmcnt(0)
6391                                                              must happen after
6392                                                              preceding
6393                                                              global/generic load
6394                                                              atomic/store
6395                                                              atomic/atomicrmw
6396                                                              with memory
6397                                                              ordering of seq_cst
6398                                                              and with equal or
6399                                                              wider sync scope.
6400                                                              (Note that seq_cst
6401                                                              fences have their
6402                                                              own s_waitcnt
6403                                                              vmcnt(0) and so do
6404                                                              not need to be
6405                                                              considered.)
6406                                                            - Ensures any
6407                                                              preceding
6408                                                              sequential
6409                                                              consistent global
6410                                                              memory instructions
6411                                                              have completed
6412                                                              before executing
6413                                                              this sequentially
6414                                                              consistent
6415                                                              instruction. This
6416                                                              prevents reordering
6417                                                              a seq_cst store
6418                                                              followed by a
6419                                                              seq_cst load. (Note
6420                                                              that seq_cst is
6421                                                              stronger than
6422                                                              acquire/release as
6423                                                              the reordering of
6424                                                              load acquire
6425                                                              followed by a store
6426                                                              release is
6427                                                              prevented by the
6428                                                              s_waitcnt of
6429                                                              the release, but
6430                                                              there is nothing
6431                                                              preventing a store
6432                                                              release followed by
6433                                                              load acquire from
6434                                                              completing out of
6435                                                              order. The s_waitcnt
6436                                                              could be placed after
6437                                                              seq_store or before
6438                                                              the seq_load. We
6439                                                              choose the load to
6440                                                              make the s_waitcnt be
6441                                                              as late as possible
6442                                                              so that the store
6443                                                              may have already
6444                                                              completed.)
6445
6446                                                          2. *Following
6447                                                             instructions same as
6448                                                             corresponding load
6449                                                             atomic acquire,
6450                                                             except must generate
6451                                                             all instructions even
6452                                                             for OpenCL.*
6453      store atomic seq_cst      - singlethread - global   *Same as corresponding
6454                                - wavefront    - local    store atomic release,
6455                                - workgroup    - generic  except must generate
6456                                - agent                   all instructions even
6457                                - system                  for OpenCL.*
6458      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
6459                                - wavefront    - local    atomicrmw acq_rel,
6460                                - workgroup    - generic  except must generate
6461                                - agent                   all instructions even
6462                                - system                  for OpenCL.*
6463      fence        seq_cst      - singlethread *none*     *Same as corresponding
6464                                - wavefront               fence acq_rel,
6465                                - workgroup               except must generate
6466                                - agent                   all instructions even
6467                                - system                  for OpenCL.*
6468      ============ ============ ============== ========== ================================
6469
6470 .. _amdgpu-amdhsa-memory-model-gfx90a:
6471
6472 Memory Model GFX90A
6473 +++++++++++++++++++
6474
6475 For GFX90A:
6476
6477 * Each agent has multiple shader arrays (SA).
6478 * Each SA has multiple compute units (CU).
6479 * Each CU has multiple SIMDs that execute wavefronts.
6480 * The wavefronts for a single work-group are executed in the same CU but may be
6481   executed by different SIMDs. The exception is when in tgsplit execution mode
6482   when the wavefronts may be executed by different SIMDs in different CUs.
6483 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
6484   executing on it. The exception is when in tgsplit execution mode when no LDS
6485   is allocated as wavefronts of the same work-group can be in different CUs.
6486 * All LDS operations of a CU are performed as wavefront wide operations in a
6487   global order and involve no caching. Completion is reported to a wavefront in
6488   execution order.
6489 * The LDS memory has multiple request queues shared by the SIMDs of a
6490   CU. Therefore, the LDS operations performed by different wavefronts of a
6491   work-group can be reordered relative to each other, which can result in
6492   reordering the visibility of vector memory operations with respect to LDS
6493   operations of other wavefronts in the same work-group. A ``s_waitcnt
6494   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6495   vector memory operations between wavefronts of a work-group, but not between
6496   operations performed by the same wavefront.
6497 * The vector memory operations are performed as wavefront wide operations and
6498   completion is reported to a wavefront in execution order. The exception is
6499   that ``flat_load/store/atomic`` instructions can report out of vector memory
6500   order if they access LDS memory, and out of LDS operation order if they access
6501   global memory.
6502 * The vector memory operations access a single vector L1 cache shared by all
6503   SIMDs a CU. Therefore:
6504
6505   * No special action is required for coherence between the lanes of a single
6506     wavefront.
6507
6508   * No special action is required for coherence between wavefronts in the same
6509     work-group since they execute on the same CU. The exception is when in
6510     tgsplit execution mode as wavefronts of the same work-group can be in
6511     different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6512     the following item.
6513
6514   * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6515     executing in different work-groups as they may be executing on different
6516     CUs.
6517
6518 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
6519   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6520   scalar operations are used in a restricted way so do not impact the memory
6521   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6522 * The vector and scalar memory operations use an L2 cache shared by all CUs on
6523   the same agent.
6524
6525   * The L2 cache has independent channels to service disjoint ranges of virtual
6526     addresses.
6527   * Each CU has a separate request queue per channel. Therefore, the vector and
6528     scalar memory operations performed by wavefronts executing in different
6529     work-groups (which may be executing on different CUs), or the same
6530     work-group if executing in tgsplit mode, of an agent can be reordered
6531     relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6532     synchronization between vector memory operations of different CUs. It
6533     ensures a previous vector memory operation has completed before executing a
6534     subsequent vector memory or LDS operation and so can be used to meet the
6535     requirements of acquire and release.
6536   * The L2 cache of one agent can be kept coherent with other agents by:
6537     using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6538     C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6539     the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6540
6541     * Any local memory cache lines will be automatically invalidated by writes
6542       from CUs associated with other L2 caches, or writes from the CPU, due to
6543       the cache probe caused by coherent requests. Coherent requests are caused
6544       by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6545       XGMI, and by PCIe requests that are configured to be coherent requests.
6546     * XGMI accesses from the CPU to local memory may be cached on the CPU.
6547       Subsequent access from the GPU will automatically invalidate or writeback
6548       the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6549     * Since all work-groups on the same agent share the same L2, no L2
6550       invalidation or writeback is required for coherence.
6551     * To ensure coherence of local and remote memory writes of work-groups in
6552       different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6553       cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6554       ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6555       fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6556       remote fine grain memory) bypasses the L2, so both will never result in
6557       dirty L2 cache lines.
6558     * To ensure coherence of local and remote memory reads of work-groups in
6559       different agents a ``buffer_invl2`` is required. It will invalidate L2
6560       cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6561       MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6562       coarse memory) cause local reads to be invalidated by remote writes with
6563       with the PTE C-bit so these cache lines are not invalidated. Note that
6564       MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6565       never result in L2 cache lines that need to be invalidated.
6566
6567   * PCIe access from the GPU to the CPU memory is kept coherent by using the
6568     MTYPE UC (uncached) which bypasses the L2.
6569
6570 Scalar memory operations are only used to access memory that is proven to not
6571 change during the execution of the kernel dispatch. This includes constant
6572 address space and global address space for program scope ``const`` variables.
6573 Therefore, the kernel machine code does not have to maintain the scalar cache to
6574 ensure it is coherent with the vector caches. The scalar and vector caches are
6575 invalidated between kernel dispatches by CP since constant address space data
6576 may change between kernel dispatch executions. See
6577 :ref:`amdgpu-amdhsa-memory-spaces`.
6578
6579 The one exception is if scalar writes are used to spill SGPR registers. In this
6580 case the AMDGPU backend ensures the memory location used to spill is never
6581 accessed by vector memory operations at the same time. If scalar writes are used
6582 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6583 return since the locations may be used for vector memory instructions by a
6584 future wavefront that uses the same scratch area, or a function call that
6585 creates a frame at the same address, respectively. There is no need for a
6586 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6587
6588 For kernarg backing memory:
6589
6590 * CP invalidates the L1 cache at the start of each kernel dispatch.
6591 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6592   memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6593   cache. This also causes it to be treated as non-volatile and so is not
6594   invalidated by ``*_vol``.
6595 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6596   so the L2 cache will be coherent with the CPU and other agents.
6597
6598 Scratch backing memory (which is used for the private address space) is accessed
6599 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6600 only accessed by a single thread, and is always write-before-read, there is
6601 never a need to invalidate these entries from the L1 cache. Hence all cache
6602 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6603
6604 The code sequences used to implement the memory model for GFX90A are defined
6605 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6606
6607   .. table:: AMDHSA Memory Model Code Sequences GFX90A
6608      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6609
6610      ============ ============ ============== ========== ================================
6611      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
6612                   Ordering     Sync Scope     Address    GFX90A
6613                                               Space
6614      ============ ============ ============== ========== ================================
6615      **Non-Atomic**
6616      ------------------------------------------------------------------------------------
6617      load         *none*       *none*         - global   - !volatile & !nontemporal
6618                                               - generic
6619                                               - private    1. buffer/global/flat_load
6620                                               - constant
6621                                                          - !volatile & nontemporal
6622
6623                                                            1. buffer/global/flat_load
6624                                                               glc=1 slc=1
6625
6626                                                          - volatile
6627
6628                                                            1. buffer/global/flat_load
6629                                                               glc=1
6630                                                            2. s_waitcnt vmcnt(0)
6631
6632                                                             - Must happen before
6633                                                               any following volatile
6634                                                               global/generic
6635                                                               load/store.
6636                                                             - Ensures that
6637                                                               volatile
6638                                                               operations to
6639                                                               different
6640                                                               addresses will not
6641                                                               be reordered by
6642                                                               hardware.
6643
6644      load         *none*       *none*         - local    1. ds_load
6645      store        *none*       *none*         - global   - !volatile & !nontemporal
6646                                               - generic
6647                                               - private    1. buffer/global/flat_store
6648                                               - constant
6649                                                          - !volatile & nontemporal
6650
6651                                                            1. buffer/global/flat_store
6652                                                               glc=1 slc=1
6653
6654                                                          - volatile
6655
6656                                                            1. buffer/global/flat_store
6657                                                            2. s_waitcnt vmcnt(0)
6658
6659                                                             - Must happen before
6660                                                               any following volatile
6661                                                               global/generic
6662                                                               load/store.
6663                                                             - Ensures that
6664                                                               volatile
6665                                                               operations to
6666                                                               different
6667                                                               addresses will not
6668                                                               be reordered by
6669                                                               hardware.
6670
6671      store        *none*       *none*         - local    1. ds_store
6672      **Unordered Atomic**
6673      ------------------------------------------------------------------------------------
6674      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
6675      store atomic unordered    *any*          *any*      *Same as non-atomic*.
6676      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
6677      **Monotonic Atomic**
6678      ------------------------------------------------------------------------------------
6679      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
6680                                - wavefront    - generic
6681      load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
6682                                               - generic     glc=1
6683
6684                                                            - If not TgSplit execution
6685                                                              mode, omit glc=1.
6686
6687      load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
6688                                - wavefront               local address space cannot
6689                                - workgroup               be used.*
6690
6691                                                          1. ds_load
6692      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
6693                                               - generic     glc=1
6694      load atomic  monotonic    - system       - global   1. buffer/global/flat_load
6695                                               - generic     glc=1
6696      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
6697                                - wavefront    - generic
6698                                - workgroup
6699                                - agent
6700      store atomic monotonic    - system       - global   1. buffer/global/flat_store
6701                                               - generic
6702      store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
6703                                - wavefront               local address space cannot
6704                                - workgroup               be used.*
6705
6706                                                          1. ds_store
6707      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
6708                                - wavefront    - generic
6709                                - workgroup
6710                                - agent
6711      atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
6712                                               - generic
6713      atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
6714                                - wavefront               local address space cannot
6715                                - workgroup               be used.*
6716
6717                                                          1. ds_atomic
6718      **Acquire Atomic**
6719      ------------------------------------------------------------------------------------
6720      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
6721                                - wavefront    - local
6722                                               - generic
6723      load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
6724
6725                                                            - If not TgSplit execution
6726                                                              mode, omit glc=1.
6727
6728                                                          2. s_waitcnt vmcnt(0)
6729
6730                                                            - If not TgSplit execution
6731                                                              mode, omit.
6732                                                            - Must happen before the
6733                                                              following buffer_wbinvl1_vol.
6734
6735                                                          3. buffer_wbinvl1_vol
6736
6737                                                            - If not TgSplit execution
6738                                                              mode, omit.
6739                                                            - Must happen before
6740                                                              any following
6741                                                              global/generic
6742                                                              load/load
6743                                                              atomic/store/store
6744                                                              atomic/atomicrmw.
6745                                                            - Ensures that
6746                                                              following
6747                                                              loads will not see
6748                                                              stale data.
6749
6750      load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
6751                                                          local address space cannot
6752                                                          be used.*
6753
6754                                                          1. ds_load
6755                                                          2. s_waitcnt lgkmcnt(0)
6756
6757                                                            - If OpenCL, omit.
6758                                                            - Must happen before
6759                                                              any following
6760                                                              global/generic
6761                                                              load/load
6762                                                              atomic/store/store
6763                                                              atomic/atomicrmw.
6764                                                            - Ensures any
6765                                                              following global
6766                                                              data read is no
6767                                                              older than the local load
6768                                                              atomic value being
6769                                                              acquired.
6770
6771      load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
6772
6773                                                            - If not TgSplit execution
6774                                                              mode, omit glc=1.
6775
6776                                                          2. s_waitcnt lgkm/vmcnt(0)
6777
6778                                                            - Use lgkmcnt(0) if not
6779                                                              TgSplit execution mode
6780                                                              and vmcnt(0) if TgSplit
6781                                                              execution mode.
6782                                                            - If OpenCL, omit lgkmcnt(0).
6783                                                            - Must happen before
6784                                                              the following
6785                                                              buffer_wbinvl1_vol and any
6786                                                              following global/generic
6787                                                              load/load
6788                                                              atomic/store/store
6789                                                              atomic/atomicrmw.
6790                                                            - Ensures any
6791                                                              following global
6792                                                              data read is no
6793                                                              older than a local load
6794                                                              atomic value being
6795                                                              acquired.
6796
6797                                                          3. buffer_wbinvl1_vol
6798
6799                                                            - If not TgSplit execution
6800                                                              mode, omit.
6801                                                            - Ensures that
6802                                                              following
6803                                                              loads will not see
6804                                                              stale data.
6805
6806      load atomic  acquire      - agent        - global   1. buffer/global_load
6807                                                             glc=1
6808                                                          2. s_waitcnt vmcnt(0)
6809
6810                                                            - Must happen before
6811                                                              following
6812                                                              buffer_wbinvl1_vol.
6813                                                            - Ensures the load
6814                                                              has completed
6815                                                              before invalidating
6816                                                              the cache.
6817
6818                                                          3. buffer_wbinvl1_vol
6819
6820                                                            - Must happen before
6821                                                              any following
6822                                                              global/generic
6823                                                              load/load
6824                                                              atomic/atomicrmw.
6825                                                            - Ensures that
6826                                                              following
6827                                                              loads will not see
6828                                                              stale global data.
6829
6830      load atomic  acquire      - system       - global   1. buffer/global/flat_load
6831                                                             glc=1
6832                                                          2. s_waitcnt vmcnt(0)
6833
6834                                                            - Must happen before
6835                                                              following buffer_invl2 and
6836                                                              buffer_wbinvl1_vol.
6837                                                            - Ensures the load
6838                                                              has completed
6839                                                              before invalidating
6840                                                              the cache.
6841
6842                                                          3. buffer_invl2;
6843                                                             buffer_wbinvl1_vol
6844
6845                                                            - Must happen before
6846                                                              any following
6847                                                              global/generic
6848                                                              load/load
6849                                                              atomic/atomicrmw.
6850                                                            - Ensures that
6851                                                              following
6852                                                              loads will not see
6853                                                              stale L1 global data,
6854                                                              nor see stale L2 MTYPE
6855                                                              NC global data.
6856                                                              MTYPE RW and CC memory will
6857                                                              never be stale in L2 due to
6858                                                              the memory probes.
6859
6860      load atomic  acquire      - agent        - generic  1. flat_load glc=1
6861                                                          2. s_waitcnt vmcnt(0) &
6862                                                             lgkmcnt(0)
6863
6864                                                            - If TgSplit execution mode,
6865                                                              omit lgkmcnt(0).
6866                                                            - If OpenCL omit
6867                                                              lgkmcnt(0).
6868                                                            - Must happen before
6869                                                              following
6870                                                              buffer_wbinvl1_vol.
6871                                                            - Ensures the flat_load
6872                                                              has completed
6873                                                              before invalidating
6874                                                              the cache.
6875
6876                                                          3. buffer_wbinvl1_vol
6877
6878                                                            - Must happen before
6879                                                              any following
6880                                                              global/generic
6881                                                              load/load
6882                                                              atomic/atomicrmw.
6883                                                            - Ensures that
6884                                                              following loads
6885                                                              will not see stale
6886                                                              global data.
6887
6888      load atomic  acquire      - system       - generic  1. flat_load glc=1
6889                                                          2. s_waitcnt vmcnt(0) &
6890                                                             lgkmcnt(0)
6891
6892                                                            - If TgSplit execution mode,
6893                                                              omit lgkmcnt(0).
6894                                                            - If OpenCL omit
6895                                                              lgkmcnt(0).
6896                                                            - Must happen before
6897                                                              following
6898                                                              buffer_invl2 and
6899                                                              buffer_wbinvl1_vol.
6900                                                            - Ensures the flat_load
6901                                                              has completed
6902                                                              before invalidating
6903                                                              the caches.
6904
6905                                                          3. buffer_invl2;
6906                                                             buffer_wbinvl1_vol
6907
6908                                                            - Must happen before
6909                                                              any following
6910                                                              global/generic
6911                                                              load/load
6912                                                              atomic/atomicrmw.
6913                                                            - Ensures that
6914                                                              following
6915                                                              loads will not see
6916                                                              stale L1 global data,
6917                                                              nor see stale L2 MTYPE
6918                                                              NC global data.
6919                                                              MTYPE RW and CC memory will
6920                                                              never be stale in L2 due to
6921                                                              the memory probes.
6922
6923      atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
6924                                - wavefront    - generic
6925      atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
6926                                - wavefront               local address space cannot
6927                                                          be used.*
6928
6929                                                          1. ds_atomic
6930      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
6931                                                          2. s_waitcnt vmcnt(0)
6932
6933                                                            - If not TgSplit execution
6934                                                              mode, omit.
6935                                                            - Must happen before the
6936                                                              following buffer_wbinvl1_vol.
6937                                                            - Ensures the atomicrmw
6938                                                              has completed
6939                                                              before invalidating
6940                                                              the cache.
6941
6942                                                          3. buffer_wbinvl1_vol
6943
6944                                                            - If not TgSplit execution
6945                                                              mode, omit.
6946                                                            - Must happen before
6947                                                              any following
6948                                                              global/generic
6949                                                              load/load
6950                                                              atomic/atomicrmw.
6951                                                            - Ensures that
6952                                                              following loads
6953                                                              will not see stale
6954                                                              global data.
6955
6956      atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
6957                                                          local address space cannot
6958                                                          be used.*
6959
6960                                                          1. ds_atomic
6961                                                          2. s_waitcnt lgkmcnt(0)
6962
6963                                                            - If OpenCL, omit.
6964                                                            - Must happen before
6965                                                              any following
6966                                                              global/generic
6967                                                              load/load
6968                                                              atomic/store/store
6969                                                              atomic/atomicrmw.
6970                                                            - Ensures any
6971                                                              following global
6972                                                              data read is no
6973                                                              older than the local
6974                                                              atomicrmw value
6975                                                              being acquired.
6976
6977      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
6978                                                          2. s_waitcnt lgkm/vmcnt(0)
6979
6980                                                            - Use lgkmcnt(0) if not
6981                                                              TgSplit execution mode
6982                                                              and vmcnt(0) if TgSplit
6983                                                              execution mode.
6984                                                            - If OpenCL, omit lgkmcnt(0).
6985                                                            - Must happen before
6986                                                              the following
6987                                                              buffer_wbinvl1_vol and
6988                                                              any following
6989                                                              global/generic
6990                                                              load/load
6991                                                              atomic/store/store
6992                                                              atomic/atomicrmw.
6993                                                            - Ensures any
6994                                                              following global
6995                                                              data read is no
6996                                                              older than a local
6997                                                              atomicrmw value
6998                                                              being acquired.
6999
7000                                                          3. buffer_wbinvl1_vol
7001
7002                                                            - If not TgSplit execution
7003                                                              mode, omit.
7004                                                            - Ensures that
7005                                                              following
7006                                                              loads will not see
7007                                                              stale data.
7008
7009      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
7010                                                          2. s_waitcnt vmcnt(0)
7011
7012                                                            - Must happen before
7013                                                              following
7014                                                              buffer_wbinvl1_vol.
7015                                                            - Ensures the
7016                                                              atomicrmw has
7017                                                              completed before
7018                                                              invalidating the
7019                                                              cache.
7020
7021                                                          3. buffer_wbinvl1_vol
7022
7023                                                            - Must happen before
7024                                                              any following
7025                                                              global/generic
7026                                                              load/load
7027                                                              atomic/atomicrmw.
7028                                                            - Ensures that
7029                                                              following loads
7030                                                              will not see stale
7031                                                              global data.
7032
7033      atomicrmw    acquire      - system       - global   1. buffer/global_atomic
7034                                                          2. s_waitcnt vmcnt(0)
7035
7036                                                            - Must happen before
7037                                                              following buffer_invl2 and
7038                                                              buffer_wbinvl1_vol.
7039                                                            - Ensures the
7040                                                              atomicrmw has
7041                                                              completed before
7042                                                              invalidating the
7043                                                              caches.
7044
7045                                                          3. buffer_invl2;
7046                                                             buffer_wbinvl1_vol
7047
7048                                                            - Must happen before
7049                                                              any following
7050                                                              global/generic
7051                                                              load/load
7052                                                              atomic/atomicrmw.
7053                                                            - Ensures that
7054                                                              following
7055                                                              loads will not see
7056                                                              stale L1 global data,
7057                                                              nor see stale L2 MTYPE
7058                                                              NC global data.
7059                                                              MTYPE RW and CC memory will
7060                                                              never be stale in L2 due to
7061                                                              the memory probes.
7062
7063      atomicrmw    acquire      - agent        - generic  1. flat_atomic
7064                                                          2. s_waitcnt vmcnt(0) &
7065                                                             lgkmcnt(0)
7066
7067                                                            - If TgSplit execution mode,
7068                                                              omit lgkmcnt(0).
7069                                                            - If OpenCL, omit
7070                                                              lgkmcnt(0).
7071                                                            - Must happen before
7072                                                              following
7073                                                              buffer_wbinvl1_vol.
7074                                                            - Ensures the
7075                                                              atomicrmw has
7076                                                              completed before
7077                                                              invalidating the
7078                                                              cache.
7079
7080                                                          3. buffer_wbinvl1_vol
7081
7082                                                            - Must happen before
7083                                                              any following
7084                                                              global/generic
7085                                                              load/load
7086                                                              atomic/atomicrmw.
7087                                                            - Ensures that
7088                                                              following loads
7089                                                              will not see stale
7090                                                              global data.
7091
7092      atomicrmw    acquire      - system       - generic  1. flat_atomic
7093                                                          2. s_waitcnt vmcnt(0) &
7094                                                             lgkmcnt(0)
7095
7096                                                            - If TgSplit execution mode,
7097                                                              omit lgkmcnt(0).
7098                                                            - If OpenCL, omit
7099                                                              lgkmcnt(0).
7100                                                            - Must happen before
7101                                                              following
7102                                                              buffer_invl2 and
7103                                                              buffer_wbinvl1_vol.
7104                                                            - Ensures the
7105                                                              atomicrmw has
7106                                                              completed before
7107                                                              invalidating the
7108                                                              caches.
7109
7110                                                          3. buffer_invl2;
7111                                                             buffer_wbinvl1_vol
7112
7113                                                            - Must happen before
7114                                                              any following
7115                                                              global/generic
7116                                                              load/load
7117                                                              atomic/atomicrmw.
7118                                                            - Ensures that
7119                                                              following
7120                                                              loads will not see
7121                                                              stale L1 global data,
7122                                                              nor see stale L2 MTYPE
7123                                                              NC global data.
7124                                                              MTYPE RW and CC memory will
7125                                                              never be stale in L2 due to
7126                                                              the memory probes.
7127
7128      fence        acquire      - singlethread *none*     *none*
7129                                - wavefront
7130      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7131
7132                                                            - Use lgkmcnt(0) if not
7133                                                              TgSplit execution mode
7134                                                              and vmcnt(0) if TgSplit
7135                                                              execution mode.
7136                                                            - If OpenCL and
7137                                                              address space is
7138                                                              not generic, omit
7139                                                              lgkmcnt(0).
7140                                                            - If OpenCL and
7141                                                              address space is
7142                                                              local, omit
7143                                                              vmcnt(0).
7144                                                            - However, since LLVM
7145                                                              currently has no
7146                                                              address space on
7147                                                              the fence need to
7148                                                              conservatively
7149                                                              always generate. If
7150                                                              fence had an
7151                                                              address space then
7152                                                              set to address
7153                                                              space of OpenCL
7154                                                              fence flag, or to
7155                                                              generic if both
7156                                                              local and global
7157                                                              flags are
7158                                                              specified.
7159                                                            - s_waitcnt vmcnt(0)
7160                                                              must happen after
7161                                                              any preceding
7162                                                              global/generic load
7163                                                              atomic/
7164                                                              atomicrmw
7165                                                              with an equal or
7166                                                              wider sync scope
7167                                                              and memory ordering
7168                                                              stronger than
7169                                                              unordered (this is
7170                                                              termed the
7171                                                              fence-paired-atomic).
7172                                                            - s_waitcnt lgkmcnt(0)
7173                                                              must happen after
7174                                                              any preceding
7175                                                              local/generic load
7176                                                              atomic/atomicrmw
7177                                                              with an equal or
7178                                                              wider sync scope
7179                                                              and memory ordering
7180                                                              stronger than
7181                                                              unordered (this is
7182                                                              termed the
7183                                                              fence-paired-atomic).
7184                                                            - Must happen before
7185                                                              the following
7186                                                              buffer_wbinvl1_vol and
7187                                                              any following
7188                                                              global/generic
7189                                                              load/load
7190                                                              atomic/store/store
7191                                                              atomic/atomicrmw.
7192                                                            - Ensures any
7193                                                              following global
7194                                                              data read is no
7195                                                              older than the
7196                                                              value read by the
7197                                                              fence-paired-atomic.
7198
7199                                                          2. buffer_wbinvl1_vol
7200
7201                                                            - If not TgSplit execution
7202                                                              mode, omit.
7203                                                            - Ensures that
7204                                                              following
7205                                                              loads will not see
7206                                                              stale data.
7207
7208      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7209                                                             vmcnt(0)
7210
7211                                                            - If TgSplit execution mode,
7212                                                              omit lgkmcnt(0).
7213                                                            - If OpenCL and
7214                                                              address space is
7215                                                              not generic, omit
7216                                                              lgkmcnt(0).
7217                                                            - However, since LLVM
7218                                                              currently has no
7219                                                              address space on
7220                                                              the fence need to
7221                                                              conservatively
7222                                                              always generate
7223                                                              (see comment for
7224                                                              previous fence).
7225                                                            - Could be split into
7226                                                              separate s_waitcnt
7227                                                              vmcnt(0) and
7228                                                              s_waitcnt
7229                                                              lgkmcnt(0) to allow
7230                                                              them to be
7231                                                              independently moved
7232                                                              according to the
7233                                                              following rules.
7234                                                            - s_waitcnt vmcnt(0)
7235                                                              must happen after
7236                                                              any preceding
7237                                                              global/generic load
7238                                                              atomic/atomicrmw
7239                                                              with an equal or
7240                                                              wider sync scope
7241                                                              and memory ordering
7242                                                              stronger than
7243                                                              unordered (this is
7244                                                              termed the
7245                                                              fence-paired-atomic).
7246                                                            - s_waitcnt lgkmcnt(0)
7247                                                              must happen after
7248                                                              any preceding
7249                                                              local/generic load
7250                                                              atomic/atomicrmw
7251                                                              with an equal or
7252                                                              wider sync scope
7253                                                              and memory ordering
7254                                                              stronger than
7255                                                              unordered (this is
7256                                                              termed the
7257                                                              fence-paired-atomic).
7258                                                            - Must happen before
7259                                                              the following
7260                                                              buffer_wbinvl1_vol.
7261                                                            - Ensures that the
7262                                                              fence-paired atomic
7263                                                              has completed
7264                                                              before invalidating
7265                                                              the
7266                                                              cache. Therefore
7267                                                              any following
7268                                                              locations read must
7269                                                              be no older than
7270                                                              the value read by
7271                                                              the
7272                                                              fence-paired-atomic.
7273
7274                                                          2. buffer_wbinvl1_vol
7275
7276                                                            - Must happen before any
7277                                                              following global/generic
7278                                                              load/load
7279                                                              atomic/store/store
7280                                                              atomic/atomicrmw.
7281                                                            - Ensures that
7282                                                              following loads
7283                                                              will not see stale
7284                                                              global data.
7285
7286      fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
7287                                                             vmcnt(0)
7288
7289                                                            - If TgSplit execution mode,
7290                                                              omit lgkmcnt(0).
7291                                                            - If OpenCL and
7292                                                              address space is
7293                                                              not generic, omit
7294                                                              lgkmcnt(0).
7295                                                            - However, since LLVM
7296                                                              currently has no
7297                                                              address space on
7298                                                              the fence need to
7299                                                              conservatively
7300                                                              always generate
7301                                                              (see comment for
7302                                                              previous fence).
7303                                                            - Could be split into
7304                                                              separate s_waitcnt
7305                                                              vmcnt(0) and
7306                                                              s_waitcnt
7307                                                              lgkmcnt(0) to allow
7308                                                              them to be
7309                                                              independently moved
7310                                                              according to the
7311                                                              following rules.
7312                                                            - s_waitcnt vmcnt(0)
7313                                                              must happen after
7314                                                              any preceding
7315                                                              global/generic load
7316                                                              atomic/atomicrmw
7317                                                              with an equal or
7318                                                              wider sync scope
7319                                                              and memory ordering
7320                                                              stronger than
7321                                                              unordered (this is
7322                                                              termed the
7323                                                              fence-paired-atomic).
7324                                                            - s_waitcnt lgkmcnt(0)
7325                                                              must happen after
7326                                                              any preceding
7327                                                              local/generic load
7328                                                              atomic/atomicrmw
7329                                                              with an equal or
7330                                                              wider sync scope
7331                                                              and memory ordering
7332                                                              stronger than
7333                                                              unordered (this is
7334                                                              termed the
7335                                                              fence-paired-atomic).
7336                                                            - Must happen before
7337                                                              the following buffer_invl2 and
7338                                                              buffer_wbinvl1_vol.
7339                                                            - Ensures that the
7340                                                              fence-paired atomic
7341                                                              has completed
7342                                                              before invalidating
7343                                                              the
7344                                                              cache. Therefore
7345                                                              any following
7346                                                              locations read must
7347                                                              be no older than
7348                                                              the value read by
7349                                                              the
7350                                                              fence-paired-atomic.
7351
7352                                                          2. buffer_invl2;
7353                                                             buffer_wbinvl1_vol
7354
7355                                                            - Must happen before any
7356                                                              following global/generic
7357                                                              load/load
7358                                                              atomic/store/store
7359                                                              atomic/atomicrmw.
7360                                                            - Ensures that
7361                                                              following
7362                                                              loads will not see
7363                                                              stale L1 global data,
7364                                                              nor see stale L2 MTYPE
7365                                                              NC global data.
7366                                                              MTYPE RW and CC memory will
7367                                                              never be stale in L2 due to
7368                                                              the memory probes.
7369      **Release Atomic**
7370      ------------------------------------------------------------------------------------
7371      store atomic release      - singlethread - global   1. buffer/global/flat_store
7372                                - wavefront    - generic
7373      store atomic release      - singlethread - local    *If TgSplit execution mode,
7374                                - wavefront               local address space cannot
7375                                                          be used.*
7376
7377                                                          1. ds_store
7378      store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7379                                               - generic
7380                                                            - Use lgkmcnt(0) if not
7381                                                              TgSplit execution mode
7382                                                              and vmcnt(0) if TgSplit
7383                                                              execution mode.
7384                                                            - If OpenCL, omit lgkmcnt(0).
7385                                                            - s_waitcnt vmcnt(0)
7386                                                              must happen after
7387                                                              any preceding
7388                                                              global/generic load/store/
7389                                                              load atomic/store atomic/
7390                                                              atomicrmw.
7391                                                            - s_waitcnt lgkmcnt(0)
7392                                                              must happen after
7393                                                              any preceding
7394                                                              local/generic
7395                                                              load/store/load
7396                                                              atomic/store
7397                                                              atomic/atomicrmw.
7398                                                            - Must happen before
7399                                                              the following
7400                                                              store.
7401                                                            - Ensures that all
7402                                                              memory operations
7403                                                              have
7404                                                              completed before
7405                                                              performing the
7406                                                              store that is being
7407                                                              released.
7408
7409                                                          2. buffer/global/flat_store
7410      store atomic release      - workgroup    - local    *If TgSplit execution mode,
7411                                                          local address space cannot
7412                                                          be used.*
7413
7414                                                          1. ds_store
7415      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7416                                               - generic     vmcnt(0)
7417
7418                                                            - If TgSplit execution mode,
7419                                                              omit lgkmcnt(0).
7420                                                            - If OpenCL and
7421                                                              address space is
7422                                                              not generic, omit
7423                                                              lgkmcnt(0).
7424                                                            - Could be split into
7425                                                              separate s_waitcnt
7426                                                              vmcnt(0) and
7427                                                              s_waitcnt
7428                                                              lgkmcnt(0) to allow
7429                                                              them to be
7430                                                              independently moved
7431                                                              according to the
7432                                                              following rules.
7433                                                            - s_waitcnt vmcnt(0)
7434                                                              must happen after
7435                                                              any preceding
7436                                                              global/generic
7437                                                              load/store/load
7438                                                              atomic/store
7439                                                              atomic/atomicrmw.
7440                                                            - s_waitcnt lgkmcnt(0)
7441                                                              must happen after
7442                                                              any preceding
7443                                                              local/generic
7444                                                              load/store/load
7445                                                              atomic/store
7446                                                              atomic/atomicrmw.
7447                                                            - Must happen before
7448                                                              the following
7449                                                              store.
7450                                                            - Ensures that all
7451                                                              memory operations
7452                                                              to memory have
7453                                                              completed before
7454                                                              performing the
7455                                                              store that is being
7456                                                              released.
7457
7458                                                          2. buffer/global/flat_store
7459      store atomic release      - system       - global   1. buffer_wbl2
7460                                               - generic
7461                                                            - Must happen before
7462                                                              following s_waitcnt.
7463                                                            - Performs L2 writeback to
7464                                                              ensure previous
7465                                                              global/generic
7466                                                              store/atomicrmw are
7467                                                              visible at system scope.
7468
7469                                                          2. s_waitcnt lgkmcnt(0) &
7470                                                             vmcnt(0)
7471
7472                                                            - If TgSplit execution mode,
7473                                                              omit lgkmcnt(0).
7474                                                            - If OpenCL and
7475                                                              address space is
7476                                                              not generic, omit
7477                                                              lgkmcnt(0).
7478                                                            - Could be split into
7479                                                              separate s_waitcnt
7480                                                              vmcnt(0) and
7481                                                              s_waitcnt
7482                                                              lgkmcnt(0) to allow
7483                                                              them to be
7484                                                              independently moved
7485                                                              according to the
7486                                                              following rules.
7487                                                            - s_waitcnt vmcnt(0)
7488                                                              must happen after any
7489                                                              preceding
7490                                                              global/generic
7491                                                              load/store/load
7492                                                              atomic/store
7493                                                              atomic/atomicrmw.
7494                                                            - s_waitcnt lgkmcnt(0)
7495                                                              must happen after any
7496                                                              preceding
7497                                                              local/generic
7498                                                              load/store/load
7499                                                              atomic/store
7500                                                              atomic/atomicrmw.
7501                                                            - Must happen before
7502                                                              the following
7503                                                              store.
7504                                                            - Ensures that all
7505                                                              memory operations
7506                                                              to memory and the L2
7507                                                              writeback have
7508                                                              completed before
7509                                                              performing the
7510                                                              store that is being
7511                                                              released.
7512
7513                                                          3. buffer/global/flat_store
7514      atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
7515                                - wavefront    - generic
7516      atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
7517                                - wavefront               local address space cannot
7518                                                          be used.*
7519
7520                                                          1. ds_atomic
7521      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7522                                               - generic
7523                                                            - Use lgkmcnt(0) if not
7524                                                              TgSplit execution mode
7525                                                              and vmcnt(0) if TgSplit
7526                                                              execution mode.
7527                                                            - If OpenCL, omit
7528                                                              lgkmcnt(0).
7529                                                            - s_waitcnt vmcnt(0)
7530                                                              must happen after
7531                                                              any preceding
7532                                                              global/generic load/store/
7533                                                              load atomic/store atomic/
7534                                                              atomicrmw.
7535                                                            - s_waitcnt lgkmcnt(0)
7536                                                              must happen after
7537                                                              any preceding
7538                                                              local/generic
7539                                                              load/store/load
7540                                                              atomic/store
7541                                                              atomic/atomicrmw.
7542                                                            - Must happen before
7543                                                              the following
7544                                                              atomicrmw.
7545                                                            - Ensures that all
7546                                                              memory operations
7547                                                              have
7548                                                              completed before
7549                                                              performing the
7550                                                              atomicrmw that is
7551                                                              being released.
7552
7553                                                          2. buffer/global/flat_atomic
7554      atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
7555                                                          local address space cannot
7556                                                          be used.*
7557
7558                                                          1. ds_atomic
7559      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7560                                               - generic     vmcnt(0)
7561
7562                                                            - If TgSplit execution mode,
7563                                                              omit lgkmcnt(0).
7564                                                            - If OpenCL, omit
7565                                                              lgkmcnt(0).
7566                                                            - Could be split into
7567                                                              separate s_waitcnt
7568                                                              vmcnt(0) and
7569                                                              s_waitcnt
7570                                                              lgkmcnt(0) to allow
7571                                                              them to be
7572                                                              independently moved
7573                                                              according to the
7574                                                              following rules.
7575                                                            - s_waitcnt vmcnt(0)
7576                                                              must happen after
7577                                                              any preceding
7578                                                              global/generic
7579                                                              load/store/load
7580                                                              atomic/store
7581                                                              atomic/atomicrmw.
7582                                                            - s_waitcnt lgkmcnt(0)
7583                                                              must happen after
7584                                                              any preceding
7585                                                              local/generic
7586                                                              load/store/load
7587                                                              atomic/store
7588                                                              atomic/atomicrmw.
7589                                                            - Must happen before
7590                                                              the following
7591                                                              atomicrmw.
7592                                                            - Ensures that all
7593                                                              memory operations
7594                                                              to global and local
7595                                                              have completed
7596                                                              before performing
7597                                                              the atomicrmw that
7598                                                              is being released.
7599
7600                                                          2. buffer/global/flat_atomic
7601      atomicrmw    release      - system       - global   1. buffer_wbl2
7602                                               - generic
7603                                                            - Must happen before
7604                                                              following s_waitcnt.
7605                                                            - Performs L2 writeback to
7606                                                              ensure previous
7607                                                              global/generic
7608                                                              store/atomicrmw are
7609                                                              visible at system scope.
7610
7611                                                          2. s_waitcnt lgkmcnt(0) &
7612                                                             vmcnt(0)
7613
7614                                                            - If TgSplit execution mode,
7615                                                              omit lgkmcnt(0).
7616                                                            - If OpenCL, omit
7617                                                              lgkmcnt(0).
7618                                                            - Could be split into
7619                                                              separate s_waitcnt
7620                                                              vmcnt(0) and
7621                                                              s_waitcnt
7622                                                              lgkmcnt(0) to allow
7623                                                              them to be
7624                                                              independently moved
7625                                                              according to the
7626                                                              following rules.
7627                                                            - s_waitcnt vmcnt(0)
7628                                                              must happen after
7629                                                              any preceding
7630                                                              global/generic
7631                                                              load/store/load
7632                                                              atomic/store
7633                                                              atomic/atomicrmw.
7634                                                            - s_waitcnt lgkmcnt(0)
7635                                                              must happen after
7636                                                              any preceding
7637                                                              local/generic
7638                                                              load/store/load
7639                                                              atomic/store
7640                                                              atomic/atomicrmw.
7641                                                            - Must happen before
7642                                                              the following
7643                                                              atomicrmw.
7644                                                            - Ensures that all
7645                                                              memory operations
7646                                                              to memory and the L2
7647                                                              writeback have
7648                                                              completed before
7649                                                              performing the
7650                                                              store that is being
7651                                                              released.
7652
7653                                                          3. buffer/global/flat_atomic
7654      fence        release      - singlethread *none*     *none*
7655                                - wavefront
7656      fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7657
7658                                                            - Use lgkmcnt(0) if not
7659                                                              TgSplit execution mode
7660                                                              and vmcnt(0) if TgSplit
7661                                                              execution mode.
7662                                                            - If OpenCL and
7663                                                              address space is
7664                                                              not generic, omit
7665                                                              lgkmcnt(0).
7666                                                            - If OpenCL and
7667                                                              address space is
7668                                                              local, omit
7669                                                              vmcnt(0).
7670                                                            - However, since LLVM
7671                                                              currently has no
7672                                                              address space on
7673                                                              the fence need to
7674                                                              conservatively
7675                                                              always generate. If
7676                                                              fence had an
7677                                                              address space then
7678                                                              set to address
7679                                                              space of OpenCL
7680                                                              fence flag, or to
7681                                                              generic if both
7682                                                              local and global
7683                                                              flags are
7684                                                              specified.
7685                                                            - s_waitcnt vmcnt(0)
7686                                                              must happen after
7687                                                              any preceding
7688                                                              global/generic
7689                                                              load/store/
7690                                                              load atomic/store atomic/
7691                                                              atomicrmw.
7692                                                            - s_waitcnt lgkmcnt(0)
7693                                                              must happen after
7694                                                              any preceding
7695                                                              local/generic
7696                                                              load/load
7697                                                              atomic/store/store
7698                                                              atomic/atomicrmw.
7699                                                            - Must happen before
7700                                                              any following store
7701                                                              atomic/atomicrmw
7702                                                              with an equal or
7703                                                              wider sync scope
7704                                                              and memory ordering
7705                                                              stronger than
7706                                                              unordered (this is
7707                                                              termed the
7708                                                              fence-paired-atomic).
7709                                                            - Ensures that all
7710                                                              memory operations
7711                                                              have
7712                                                              completed before
7713                                                              performing the
7714                                                              following
7715                                                              fence-paired-atomic.
7716
7717      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7718                                                             vmcnt(0)
7719
7720                                                            - If TgSplit execution mode,
7721                                                              omit lgkmcnt(0).
7722                                                            - If OpenCL and
7723                                                              address space is
7724                                                              not generic, omit
7725                                                              lgkmcnt(0).
7726                                                            - If OpenCL and
7727                                                              address space is
7728                                                              local, omit
7729                                                              vmcnt(0).
7730                                                            - However, since LLVM
7731                                                              currently has no
7732                                                              address space on
7733                                                              the fence need to
7734                                                              conservatively
7735                                                              always generate. If
7736                                                              fence had an
7737                                                              address space then
7738                                                              set to address
7739                                                              space of OpenCL
7740                                                              fence flag, or to
7741                                                              generic if both
7742                                                              local and global
7743                                                              flags are
7744                                                              specified.
7745                                                            - Could be split into
7746                                                              separate s_waitcnt
7747                                                              vmcnt(0) and
7748                                                              s_waitcnt
7749                                                              lgkmcnt(0) to allow
7750                                                              them to be
7751                                                              independently moved
7752                                                              according to the
7753                                                              following rules.
7754                                                            - s_waitcnt vmcnt(0)
7755                                                              must happen after
7756                                                              any preceding
7757                                                              global/generic
7758                                                              load/store/load
7759                                                              atomic/store
7760                                                              atomic/atomicrmw.
7761                                                            - s_waitcnt lgkmcnt(0)
7762                                                              must happen after
7763                                                              any preceding
7764                                                              local/generic
7765                                                              load/store/load
7766                                                              atomic/store
7767                                                              atomic/atomicrmw.
7768                                                            - Must happen before
7769                                                              any following store
7770                                                              atomic/atomicrmw
7771                                                              with an equal or
7772                                                              wider sync scope
7773                                                              and memory ordering
7774                                                              stronger than
7775                                                              unordered (this is
7776                                                              termed the
7777                                                              fence-paired-atomic).
7778                                                            - Ensures that all
7779                                                              memory operations
7780                                                              have
7781                                                              completed before
7782                                                              performing the
7783                                                              following
7784                                                              fence-paired-atomic.
7785
7786      fence        release      - system       *none*     1. buffer_wbl2
7787
7788                                                            - If OpenCL and
7789                                                              address space is
7790                                                              local, omit.
7791                                                            - Must happen before
7792                                                              following s_waitcnt.
7793                                                            - Performs L2 writeback to
7794                                                              ensure previous
7795                                                              global/generic
7796                                                              store/atomicrmw are
7797                                                              visible at system scope.
7798
7799                                                          2. s_waitcnt lgkmcnt(0) &
7800                                                             vmcnt(0)
7801
7802                                                            - If TgSplit execution mode,
7803                                                              omit lgkmcnt(0).
7804                                                            - If OpenCL and
7805                                                              address space is
7806                                                              not generic, omit
7807                                                              lgkmcnt(0).
7808                                                            - If OpenCL and
7809                                                              address space is
7810                                                              local, omit
7811                                                              vmcnt(0).
7812                                                            - However, since LLVM
7813                                                              currently has no
7814                                                              address space on
7815                                                              the fence need to
7816                                                              conservatively
7817                                                              always generate. If
7818                                                              fence had an
7819                                                              address space then
7820                                                              set to address
7821                                                              space of OpenCL
7822                                                              fence flag, or to
7823                                                              generic if both
7824                                                              local and global
7825                                                              flags are
7826                                                              specified.
7827                                                            - Could be split into
7828                                                              separate s_waitcnt
7829                                                              vmcnt(0) and
7830                                                              s_waitcnt
7831                                                              lgkmcnt(0) to allow
7832                                                              them to be
7833                                                              independently moved
7834                                                              according to the
7835                                                              following rules.
7836                                                            - s_waitcnt vmcnt(0)
7837                                                              must happen after
7838                                                              any preceding
7839                                                              global/generic
7840                                                              load/store/load
7841                                                              atomic/store
7842                                                              atomic/atomicrmw.
7843                                                            - s_waitcnt lgkmcnt(0)
7844                                                              must happen after
7845                                                              any preceding
7846                                                              local/generic
7847                                                              load/store/load
7848                                                              atomic/store
7849                                                              atomic/atomicrmw.
7850                                                            - Must happen before
7851                                                              any following store
7852                                                              atomic/atomicrmw
7853                                                              with an equal or
7854                                                              wider sync scope
7855                                                              and memory ordering
7856                                                              stronger than
7857                                                              unordered (this is
7858                                                              termed the
7859                                                              fence-paired-atomic).
7860                                                            - Ensures that all
7861                                                              memory operations
7862                                                              have
7863                                                              completed before
7864                                                              performing the
7865                                                              following
7866                                                              fence-paired-atomic.
7867
7868      **Acquire-Release Atomic**
7869      ------------------------------------------------------------------------------------
7870      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
7871                                - wavefront    - generic
7872      atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
7873                                - wavefront               local address space cannot
7874                                                          be used.*
7875
7876                                                          1. ds_atomic
7877      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7878
7879                                                            - Use lgkmcnt(0) if not
7880                                                              TgSplit execution mode
7881                                                              and vmcnt(0) if TgSplit
7882                                                              execution mode.
7883                                                            - If OpenCL, omit
7884                                                              lgkmcnt(0).
7885                                                            - Must happen after
7886                                                              any preceding
7887                                                              local/generic
7888                                                              load/store/load
7889                                                              atomic/store
7890                                                              atomic/atomicrmw.
7891                                                            - s_waitcnt vmcnt(0)
7892                                                              must happen after
7893                                                              any preceding
7894                                                              global/generic load/store/
7895                                                              load atomic/store atomic/
7896                                                              atomicrmw.
7897                                                            - s_waitcnt lgkmcnt(0)
7898                                                              must happen after
7899                                                              any preceding
7900                                                              local/generic
7901                                                              load/store/load
7902                                                              atomic/store
7903                                                              atomic/atomicrmw.
7904                                                            - Must happen before
7905                                                              the following
7906                                                              atomicrmw.
7907                                                            - Ensures that all
7908                                                              memory operations
7909                                                              have
7910                                                              completed before
7911                                                              performing the
7912                                                              atomicrmw that is
7913                                                              being released.
7914
7915                                                          2. buffer/global_atomic
7916                                                          3. s_waitcnt vmcnt(0)
7917
7918                                                            - If not TgSplit execution
7919                                                              mode, omit.
7920                                                            - Must happen before
7921                                                              the following
7922                                                              buffer_wbinvl1_vol.
7923                                                            - Ensures any
7924                                                              following global
7925                                                              data read is no
7926                                                              older than the
7927                                                              atomicrmw value
7928                                                              being acquired.
7929
7930                                                          4. buffer_wbinvl1_vol
7931
7932                                                            - If not TgSplit execution
7933                                                              mode, omit.
7934                                                            - Ensures that
7935                                                              following
7936                                                              loads will not see
7937                                                              stale data.
7938
7939      atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
7940                                                          local address space cannot
7941                                                          be used.*
7942
7943                                                          1. ds_atomic
7944                                                          2. s_waitcnt lgkmcnt(0)
7945
7946                                                            - If OpenCL, omit.
7947                                                            - Must happen before
7948                                                              any following
7949                                                              global/generic
7950                                                              load/load
7951                                                              atomic/store/store
7952                                                              atomic/atomicrmw.
7953                                                            - Ensures any
7954                                                              following global
7955                                                              data read is no
7956                                                              older than the local load
7957                                                              atomic value being
7958                                                              acquired.
7959
7960      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
7961
7962                                                            - Use lgkmcnt(0) if not
7963                                                              TgSplit execution mode
7964                                                              and vmcnt(0) if TgSplit
7965                                                              execution mode.
7966                                                            - If OpenCL, omit
7967                                                              lgkmcnt(0).
7968                                                            - s_waitcnt vmcnt(0)
7969                                                              must happen after
7970                                                              any preceding
7971                                                              global/generic load/store/
7972                                                              load atomic/store atomic/
7973                                                              atomicrmw.
7974                                                            - s_waitcnt lgkmcnt(0)
7975                                                              must happen after
7976                                                              any preceding
7977                                                              local/generic
7978                                                              load/store/load
7979                                                              atomic/store
7980                                                              atomic/atomicrmw.
7981                                                            - Must happen before
7982                                                              the following
7983                                                              atomicrmw.
7984                                                            - Ensures that all
7985                                                              memory operations
7986                                                              have
7987                                                              completed before
7988                                                              performing the
7989                                                              atomicrmw that is
7990                                                              being released.
7991
7992                                                          2. flat_atomic
7993                                                          3. s_waitcnt lgkmcnt(0) &
7994                                                             vmcnt(0)
7995
7996                                                            - If not TgSplit execution
7997                                                              mode, omit vmcnt(0).
7998                                                            - If OpenCL, omit
7999                                                              lgkmcnt(0).
8000                                                            - Must happen before
8001                                                              the following
8002                                                              buffer_wbinvl1_vol and
8003                                                              any following
8004                                                              global/generic
8005                                                              load/load
8006                                                              atomic/store/store
8007                                                              atomic/atomicrmw.
8008                                                            - Ensures any
8009                                                              following global
8010                                                              data read is no
8011                                                              older than a local load
8012                                                              atomic value being
8013                                                              acquired.
8014
8015                                                          3. buffer_wbinvl1_vol
8016
8017                                                            - If not TgSplit execution
8018                                                              mode, omit.
8019                                                            - Ensures that
8020                                                              following
8021                                                              loads will not see
8022                                                              stale data.
8023
8024      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8025                                                             vmcnt(0)
8026
8027                                                            - If TgSplit execution mode,
8028                                                              omit lgkmcnt(0).
8029                                                            - If OpenCL, omit
8030                                                              lgkmcnt(0).
8031                                                            - Could be split into
8032                                                              separate s_waitcnt
8033                                                              vmcnt(0) and
8034                                                              s_waitcnt
8035                                                              lgkmcnt(0) to allow
8036                                                              them to be
8037                                                              independently moved
8038                                                              according to the
8039                                                              following rules.
8040                                                            - s_waitcnt vmcnt(0)
8041                                                              must happen after
8042                                                              any preceding
8043                                                              global/generic
8044                                                              load/store/load
8045                                                              atomic/store
8046                                                              atomic/atomicrmw.
8047                                                            - s_waitcnt lgkmcnt(0)
8048                                                              must happen after
8049                                                              any preceding
8050                                                              local/generic
8051                                                              load/store/load
8052                                                              atomic/store
8053                                                              atomic/atomicrmw.
8054                                                            - Must happen before
8055                                                              the following
8056                                                              atomicrmw.
8057                                                            - Ensures that all
8058                                                              memory operations
8059                                                              to global have
8060                                                              completed before
8061                                                              performing the
8062                                                              atomicrmw that is
8063                                                              being released.
8064
8065                                                          2. buffer/global_atomic
8066                                                          3. s_waitcnt vmcnt(0)
8067
8068                                                            - Must happen before
8069                                                              following
8070                                                              buffer_wbinvl1_vol.
8071                                                            - Ensures the
8072                                                              atomicrmw has
8073                                                              completed before
8074                                                              invalidating the
8075                                                              cache.
8076
8077                                                          4. buffer_wbinvl1_vol
8078
8079                                                            - Must happen before
8080                                                              any following
8081                                                              global/generic
8082                                                              load/load
8083                                                              atomic/atomicrmw.
8084                                                            - Ensures that
8085                                                              following loads
8086                                                              will not see stale
8087                                                              global data.
8088
8089      atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
8090
8091                                                            - Must happen before
8092                                                              following s_waitcnt.
8093                                                            - Performs L2 writeback to
8094                                                              ensure previous
8095                                                              global/generic
8096                                                              store/atomicrmw are
8097                                                              visible at system scope.
8098
8099                                                          2. s_waitcnt lgkmcnt(0) &
8100                                                             vmcnt(0)
8101
8102                                                            - If TgSplit execution mode,
8103                                                              omit lgkmcnt(0).
8104                                                            - If OpenCL, omit
8105                                                              lgkmcnt(0).
8106                                                            - Could be split into
8107                                                              separate s_waitcnt
8108                                                              vmcnt(0) and
8109                                                              s_waitcnt
8110                                                              lgkmcnt(0) to allow
8111                                                              them to be
8112                                                              independently moved
8113                                                              according to the
8114                                                              following rules.
8115                                                            - s_waitcnt vmcnt(0)
8116                                                              must happen after
8117                                                              any preceding
8118                                                              global/generic
8119                                                              load/store/load
8120                                                              atomic/store
8121                                                              atomic/atomicrmw.
8122                                                            - s_waitcnt lgkmcnt(0)
8123                                                              must happen after
8124                                                              any preceding
8125                                                              local/generic
8126                                                              load/store/load
8127                                                              atomic/store
8128                                                              atomic/atomicrmw.
8129                                                            - Must happen before
8130                                                              the following
8131                                                              atomicrmw.
8132                                                            - Ensures that all
8133                                                              memory operations
8134                                                              to global and L2 writeback
8135                                                              have completed before
8136                                                              performing the
8137                                                              atomicrmw that is
8138                                                              being released.
8139
8140                                                          3. buffer/global_atomic
8141                                                          4. s_waitcnt vmcnt(0)
8142
8143                                                            - Must happen before
8144                                                              following buffer_invl2 and
8145                                                              buffer_wbinvl1_vol.
8146                                                            - Ensures the
8147                                                              atomicrmw has
8148                                                              completed before
8149                                                              invalidating the
8150                                                              caches.
8151
8152                                                          5. buffer_invl2;
8153                                                             buffer_wbinvl1_vol
8154
8155                                                            - Must happen before
8156                                                              any following
8157                                                              global/generic
8158                                                              load/load
8159                                                              atomic/atomicrmw.
8160                                                            - Ensures that
8161                                                              following
8162                                                              loads will not see
8163                                                              stale L1 global data,
8164                                                              nor see stale L2 MTYPE
8165                                                              NC global data.
8166                                                              MTYPE RW and CC memory will
8167                                                              never be stale in L2 due to
8168                                                              the memory probes.
8169
8170      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
8171                                                             vmcnt(0)
8172
8173                                                            - If TgSplit execution mode,
8174                                                              omit lgkmcnt(0).
8175                                                            - If OpenCL, omit
8176                                                              lgkmcnt(0).
8177                                                            - Could be split into
8178                                                              separate s_waitcnt
8179                                                              vmcnt(0) and
8180                                                              s_waitcnt
8181                                                              lgkmcnt(0) to allow
8182                                                              them to be
8183                                                              independently moved
8184                                                              according to the
8185                                                              following rules.
8186                                                            - s_waitcnt vmcnt(0)
8187                                                              must happen after
8188                                                              any preceding
8189                                                              global/generic
8190                                                              load/store/load
8191                                                              atomic/store
8192                                                              atomic/atomicrmw.
8193                                                            - s_waitcnt lgkmcnt(0)
8194                                                              must happen after
8195                                                              any preceding
8196                                                              local/generic
8197                                                              load/store/load
8198                                                              atomic/store
8199                                                              atomic/atomicrmw.
8200                                                            - Must happen before
8201                                                              the following
8202                                                              atomicrmw.
8203                                                            - Ensures that all
8204                                                              memory operations
8205                                                              to global have
8206                                                              completed before
8207                                                              performing the
8208                                                              atomicrmw that is
8209                                                              being released.
8210
8211                                                          2. flat_atomic
8212                                                          3. s_waitcnt vmcnt(0) &
8213                                                             lgkmcnt(0)
8214
8215                                                            - If TgSplit execution mode,
8216                                                              omit lgkmcnt(0).
8217                                                            - If OpenCL, omit
8218                                                              lgkmcnt(0).
8219                                                            - Must happen before
8220                                                              following
8221                                                              buffer_wbinvl1_vol.
8222                                                            - Ensures the
8223                                                              atomicrmw has
8224                                                              completed before
8225                                                              invalidating the
8226                                                              cache.
8227
8228                                                          4. buffer_wbinvl1_vol
8229
8230                                                            - Must happen before
8231                                                              any following
8232                                                              global/generic
8233                                                              load/load
8234                                                              atomic/atomicrmw.
8235                                                            - Ensures that
8236                                                              following loads
8237                                                              will not see stale
8238                                                              global data.
8239
8240      atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
8241
8242                                                            - Must happen before
8243                                                              following s_waitcnt.
8244                                                            - Performs L2 writeback to
8245                                                              ensure previous
8246                                                              global/generic
8247                                                              store/atomicrmw are
8248                                                              visible at system scope.
8249
8250                                                          2. s_waitcnt lgkmcnt(0) &
8251                                                             vmcnt(0)
8252
8253                                                            - If TgSplit execution mode,
8254                                                              omit lgkmcnt(0).
8255                                                            - If OpenCL, omit
8256                                                              lgkmcnt(0).
8257                                                            - Could be split into
8258                                                              separate s_waitcnt
8259                                                              vmcnt(0) and
8260                                                              s_waitcnt
8261                                                              lgkmcnt(0) to allow
8262                                                              them to be
8263                                                              independently moved
8264                                                              according to the
8265                                                              following rules.
8266                                                            - s_waitcnt vmcnt(0)
8267                                                              must happen after
8268                                                              any preceding
8269                                                              global/generic
8270                                                              load/store/load
8271                                                              atomic/store
8272                                                              atomic/atomicrmw.
8273                                                            - s_waitcnt lgkmcnt(0)
8274                                                              must happen after
8275                                                              any preceding
8276                                                              local/generic
8277                                                              load/store/load
8278                                                              atomic/store
8279                                                              atomic/atomicrmw.
8280                                                            - Must happen before
8281                                                              the following
8282                                                              atomicrmw.
8283                                                            - Ensures that all
8284                                                              memory operations
8285                                                              to global and L2 writeback
8286                                                              have completed before
8287                                                              performing the
8288                                                              atomicrmw that is
8289                                                              being released.
8290
8291                                                          3. flat_atomic
8292                                                          4. s_waitcnt vmcnt(0) &
8293                                                             lgkmcnt(0)
8294
8295                                                            - If TgSplit execution mode,
8296                                                              omit lgkmcnt(0).
8297                                                            - If OpenCL, omit
8298                                                              lgkmcnt(0).
8299                                                            - Must happen before
8300                                                              following buffer_invl2 and
8301                                                              buffer_wbinvl1_vol.
8302                                                            - Ensures the
8303                                                              atomicrmw has
8304                                                              completed before
8305                                                              invalidating the
8306                                                              caches.
8307
8308                                                          5. buffer_invl2;
8309                                                             buffer_wbinvl1_vol
8310
8311                                                            - Must happen before
8312                                                              any following
8313                                                              global/generic
8314                                                              load/load
8315                                                              atomic/atomicrmw.
8316                                                            - Ensures that
8317                                                              following
8318                                                              loads will not see
8319                                                              stale L1 global data,
8320                                                              nor see stale L2 MTYPE
8321                                                              NC global data.
8322                                                              MTYPE RW and CC memory will
8323                                                              never be stale in L2 due to
8324                                                              the memory probes.
8325
8326      fence        acq_rel      - singlethread *none*     *none*
8327                                - wavefront
8328      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
8329
8330                                                            - Use lgkmcnt(0) if not
8331                                                              TgSplit execution mode
8332                                                              and vmcnt(0) if TgSplit
8333                                                              execution mode.
8334                                                            - If OpenCL and
8335                                                              address space is
8336                                                              not generic, omit
8337                                                              lgkmcnt(0).
8338                                                            - If OpenCL and
8339                                                              address space is
8340                                                              local, omit
8341                                                              vmcnt(0).
8342                                                            - However,
8343                                                              since LLVM
8344                                                              currently has no
8345                                                              address space on
8346                                                              the fence need to
8347                                                              conservatively
8348                                                              always generate
8349                                                              (see comment for
8350                                                              previous fence).
8351                                                            - s_waitcnt vmcnt(0)
8352                                                              must happen after
8353                                                              any preceding
8354                                                              global/generic
8355                                                              load/store/
8356                                                              load atomic/store atomic/
8357                                                              atomicrmw.
8358                                                            - s_waitcnt lgkmcnt(0)
8359                                                              must happen after
8360                                                              any preceding
8361                                                              local/generic
8362                                                              load/load
8363                                                              atomic/store/store
8364                                                              atomic/atomicrmw.
8365                                                            - Must happen before
8366                                                              any following
8367                                                              global/generic
8368                                                              load/load
8369                                                              atomic/store/store
8370                                                              atomic/atomicrmw.
8371                                                            - Ensures that all
8372                                                              memory operations
8373                                                              have
8374                                                              completed before
8375                                                              performing any
8376                                                              following global
8377                                                              memory operations.
8378                                                            - Ensures that the
8379                                                              preceding
8380                                                              local/generic load
8381                                                              atomic/atomicrmw
8382                                                              with an equal or
8383                                                              wider sync scope
8384                                                              and memory ordering
8385                                                              stronger than
8386                                                              unordered (this is
8387                                                              termed the
8388                                                              acquire-fence-paired-atomic)
8389                                                              has completed
8390                                                              before following
8391                                                              global memory
8392                                                              operations. This
8393                                                              satisfies the
8394                                                              requirements of
8395                                                              acquire.
8396                                                            - Ensures that all
8397                                                              previous memory
8398                                                              operations have
8399                                                              completed before a
8400                                                              following
8401                                                              local/generic store
8402                                                              atomic/atomicrmw
8403                                                              with an equal or
8404                                                              wider sync scope
8405                                                              and memory ordering
8406                                                              stronger than
8407                                                              unordered (this is
8408                                                              termed the
8409                                                              release-fence-paired-atomic).
8410                                                              This satisfies the
8411                                                              requirements of
8412                                                              release.
8413                                                            - Must happen before
8414                                                              the following
8415                                                              buffer_wbinvl1_vol.
8416                                                            - Ensures that the
8417                                                              acquire-fence-paired
8418                                                              atomic has completed
8419                                                              before invalidating
8420                                                              the
8421                                                              cache. Therefore
8422                                                              any following
8423                                                              locations read must
8424                                                              be no older than
8425                                                              the value read by
8426                                                              the
8427                                                              acquire-fence-paired-atomic.
8428
8429                                                          2. buffer_wbinvl1_vol
8430
8431                                                            - If not TgSplit execution
8432                                                              mode, omit.
8433                                                            - Ensures that
8434                                                              following
8435                                                              loads will not see
8436                                                              stale data.
8437
8438      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
8439                                                             vmcnt(0)
8440
8441                                                            - If TgSplit execution mode,
8442                                                              omit lgkmcnt(0).
8443                                                            - If OpenCL and
8444                                                              address space is
8445                                                              not generic, omit
8446                                                              lgkmcnt(0).
8447                                                            - However, since LLVM
8448                                                              currently has no
8449                                                              address space on
8450                                                              the fence need to
8451                                                              conservatively
8452                                                              always generate
8453                                                              (see comment for
8454                                                              previous fence).
8455                                                            - Could be split into
8456                                                              separate s_waitcnt
8457                                                              vmcnt(0) and
8458                                                              s_waitcnt
8459                                                              lgkmcnt(0) to allow
8460                                                              them to be
8461                                                              independently moved
8462                                                              according to the
8463                                                              following rules.
8464                                                            - s_waitcnt vmcnt(0)
8465                                                              must happen after
8466                                                              any preceding
8467                                                              global/generic
8468                                                              load/store/load
8469                                                              atomic/store
8470                                                              atomic/atomicrmw.
8471                                                            - s_waitcnt lgkmcnt(0)
8472                                                              must happen after
8473                                                              any preceding
8474                                                              local/generic
8475                                                              load/store/load
8476                                                              atomic/store
8477                                                              atomic/atomicrmw.
8478                                                            - Must happen before
8479                                                              the following
8480                                                              buffer_wbinvl1_vol.
8481                                                            - Ensures that the
8482                                                              preceding
8483                                                              global/local/generic
8484                                                              load
8485                                                              atomic/atomicrmw
8486                                                              with an equal or
8487                                                              wider sync scope
8488                                                              and memory ordering
8489                                                              stronger than
8490                                                              unordered (this is
8491                                                              termed the
8492                                                              acquire-fence-paired-atomic)
8493                                                              has completed
8494                                                              before invalidating
8495                                                              the cache. This
8496                                                              satisfies the
8497                                                              requirements of
8498                                                              acquire.
8499                                                            - Ensures that all
8500                                                              previous memory
8501                                                              operations have
8502                                                              completed before a
8503                                                              following
8504                                                              global/local/generic
8505                                                              store
8506                                                              atomic/atomicrmw
8507                                                              with an equal or
8508                                                              wider sync scope
8509                                                              and memory ordering
8510                                                              stronger than
8511                                                              unordered (this is
8512                                                              termed the
8513                                                              release-fence-paired-atomic).
8514                                                              This satisfies the
8515                                                              requirements of
8516                                                              release.
8517
8518                                                          2. buffer_wbinvl1_vol
8519
8520                                                            - Must happen before
8521                                                              any following
8522                                                              global/generic
8523                                                              load/load
8524                                                              atomic/store/store
8525                                                              atomic/atomicrmw.
8526                                                            - Ensures that
8527                                                              following loads
8528                                                              will not see stale
8529                                                              global data. This
8530                                                              satisfies the
8531                                                              requirements of
8532                                                              acquire.
8533
8534      fence        acq_rel      - system       *none*     1. buffer_wbl2
8535
8536                                                            - If OpenCL and
8537                                                              address space is
8538                                                              local, omit.
8539                                                            - Must happen before
8540                                                              following s_waitcnt.
8541                                                            - Performs L2 writeback to
8542                                                              ensure previous
8543                                                              global/generic
8544                                                              store/atomicrmw are
8545                                                              visible at system scope.
8546
8547                                                          2. s_waitcnt lgkmcnt(0) &
8548                                                             vmcnt(0)
8549
8550                                                            - If TgSplit execution mode,
8551                                                              omit lgkmcnt(0).
8552                                                            - If OpenCL and
8553                                                              address space is
8554                                                              not generic, omit
8555                                                              lgkmcnt(0).
8556                                                            - However, since LLVM
8557                                                              currently has no
8558                                                              address space on
8559                                                              the fence need to
8560                                                              conservatively
8561                                                              always generate
8562                                                              (see comment for
8563                                                              previous fence).
8564                                                            - Could be split into
8565                                                              separate s_waitcnt
8566                                                              vmcnt(0) and
8567                                                              s_waitcnt
8568                                                              lgkmcnt(0) to allow
8569                                                              them to be
8570                                                              independently moved
8571                                                              according to the
8572                                                              following rules.
8573                                                            - s_waitcnt vmcnt(0)
8574                                                              must happen after
8575                                                              any preceding
8576                                                              global/generic
8577                                                              load/store/load
8578                                                              atomic/store
8579                                                              atomic/atomicrmw.
8580                                                            - s_waitcnt lgkmcnt(0)
8581                                                              must happen after
8582                                                              any preceding
8583                                                              local/generic
8584                                                              load/store/load
8585                                                              atomic/store
8586                                                              atomic/atomicrmw.
8587                                                            - Must happen before
8588                                                              the following buffer_invl2 and
8589                                                              buffer_wbinvl1_vol.
8590                                                            - Ensures that the
8591                                                              preceding
8592                                                              global/local/generic
8593                                                              load
8594                                                              atomic/atomicrmw
8595                                                              with an equal or
8596                                                              wider sync scope
8597                                                              and memory ordering
8598                                                              stronger than
8599                                                              unordered (this is
8600                                                              termed the
8601                                                              acquire-fence-paired-atomic)
8602                                                              has completed
8603                                                              before invalidating
8604                                                              the cache. This
8605                                                              satisfies the
8606                                                              requirements of
8607                                                              acquire.
8608                                                            - Ensures that all
8609                                                              previous memory
8610                                                              operations have
8611                                                              completed before a
8612                                                              following
8613                                                              global/local/generic
8614                                                              store
8615                                                              atomic/atomicrmw
8616                                                              with an equal or
8617                                                              wider sync scope
8618                                                              and memory ordering
8619                                                              stronger than
8620                                                              unordered (this is
8621                                                              termed the
8622                                                              release-fence-paired-atomic).
8623                                                              This satisfies the
8624                                                              requirements of
8625                                                              release.
8626
8627                                                          3.  buffer_invl2;
8628                                                              buffer_wbinvl1_vol
8629
8630                                                            - Must happen before
8631                                                              any following
8632                                                              global/generic
8633                                                              load/load
8634                                                              atomic/store/store
8635                                                              atomic/atomicrmw.
8636                                                            - Ensures that
8637                                                              following
8638                                                              loads will not see
8639                                                              stale L1 global data,
8640                                                              nor see stale L2 MTYPE
8641                                                              NC global data.
8642                                                              MTYPE RW and CC memory will
8643                                                              never be stale in L2 due to
8644                                                              the memory probes.
8645
8646      **Sequential Consistent Atomic**
8647      ------------------------------------------------------------------------------------
8648      load atomic  seq_cst      - singlethread - global   *Same as corresponding
8649                                - wavefront    - local    load atomic acquire,
8650                                               - generic  except must generate
8651                                                          all instructions even
8652                                                          for OpenCL.*
8653      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8654                                               - generic
8655                                                            - Use lgkmcnt(0) if not
8656                                                              TgSplit execution mode
8657                                                              and vmcnt(0) if TgSplit
8658                                                              execution mode.
8659                                                            - s_waitcnt lgkmcnt(0) must
8660                                                              happen after
8661                                                              preceding
8662                                                              local/generic load
8663                                                              atomic/store
8664                                                              atomic/atomicrmw
8665                                                              with memory
8666                                                              ordering of seq_cst
8667                                                              and with equal or
8668                                                              wider sync scope.
8669                                                              (Note that seq_cst
8670                                                              fences have their
8671                                                              own s_waitcnt
8672                                                              lgkmcnt(0) and so do
8673                                                              not need to be
8674                                                              considered.)
8675                                                            - s_waitcnt vmcnt(0)
8676                                                              must happen after
8677                                                              preceding
8678                                                              global/generic load
8679                                                              atomic/store
8680                                                              atomic/atomicrmw
8681                                                              with memory
8682                                                              ordering of seq_cst
8683                                                              and with equal or
8684                                                              wider sync scope.
8685                                                              (Note that seq_cst
8686                                                              fences have their
8687                                                              own s_waitcnt
8688                                                              vmcnt(0) and so do
8689                                                              not need to be
8690                                                              considered.)
8691                                                            - Ensures any
8692                                                              preceding
8693                                                              sequential
8694                                                              consistent global/local
8695                                                              memory instructions
8696                                                              have completed
8697                                                              before executing
8698                                                              this sequentially
8699                                                              consistent
8700                                                              instruction. This
8701                                                              prevents reordering
8702                                                              a seq_cst store
8703                                                              followed by a
8704                                                              seq_cst load. (Note
8705                                                              that seq_cst is
8706                                                              stronger than
8707                                                              acquire/release as
8708                                                              the reordering of
8709                                                              load acquire
8710                                                              followed by a store
8711                                                              release is
8712                                                              prevented by the
8713                                                              s_waitcnt of
8714                                                              the release, but
8715                                                              there is nothing
8716                                                              preventing a store
8717                                                              release followed by
8718                                                              load acquire from
8719                                                              completing out of
8720                                                              order. The s_waitcnt
8721                                                              could be placed after
8722                                                              seq_store or before
8723                                                              the seq_load. We
8724                                                              choose the load to
8725                                                              make the s_waitcnt be
8726                                                              as late as possible
8727                                                              so that the store
8728                                                              may have already
8729                                                              completed.)
8730
8731                                                          2. *Following
8732                                                             instructions same as
8733                                                             corresponding load
8734                                                             atomic acquire,
8735                                                             except must generate
8736                                                             all instructions even
8737                                                             for OpenCL.*
8738      load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
8739                                                          local address space cannot
8740                                                          be used.*
8741
8742                                                          *Same as corresponding
8743                                                          load atomic acquire,
8744                                                          except must generate
8745                                                          all instructions even
8746                                                          for OpenCL.*
8747
8748      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8749                                - system       - generic     vmcnt(0)
8750
8751                                                            - If TgSplit execution mode,
8752                                                              omit lgkmcnt(0).
8753                                                            - Could be split into
8754                                                              separate s_waitcnt
8755                                                              vmcnt(0)
8756                                                              and s_waitcnt
8757                                                              lgkmcnt(0) to allow
8758                                                              them to be
8759                                                              independently moved
8760                                                              according to the
8761                                                              following rules.
8762                                                            - s_waitcnt lgkmcnt(0)
8763                                                              must happen after
8764                                                              preceding
8765                                                              global/generic load
8766                                                              atomic/store
8767                                                              atomic/atomicrmw
8768                                                              with memory
8769                                                              ordering of seq_cst
8770                                                              and with equal or
8771                                                              wider sync scope.
8772                                                              (Note that seq_cst
8773                                                              fences have their
8774                                                              own s_waitcnt
8775                                                              lgkmcnt(0) and so do
8776                                                              not need to be
8777                                                              considered.)
8778                                                            - s_waitcnt vmcnt(0)
8779                                                              must happen after
8780                                                              preceding
8781                                                              global/generic load
8782                                                              atomic/store
8783                                                              atomic/atomicrmw
8784                                                              with memory
8785                                                              ordering of seq_cst
8786                                                              and with equal or
8787                                                              wider sync scope.
8788                                                              (Note that seq_cst
8789                                                              fences have their
8790                                                              own s_waitcnt
8791                                                              vmcnt(0) and so do
8792                                                              not need to be
8793                                                              considered.)
8794                                                            - Ensures any
8795                                                              preceding
8796                                                              sequential
8797                                                              consistent global
8798                                                              memory instructions
8799                                                              have completed
8800                                                              before executing
8801                                                              this sequentially
8802                                                              consistent
8803                                                              instruction. This
8804                                                              prevents reordering
8805                                                              a seq_cst store
8806                                                              followed by a
8807                                                              seq_cst load. (Note
8808                                                              that seq_cst is
8809                                                              stronger than
8810                                                              acquire/release as
8811                                                              the reordering of
8812                                                              load acquire
8813                                                              followed by a store
8814                                                              release is
8815                                                              prevented by the
8816                                                              s_waitcnt of
8817                                                              the release, but
8818                                                              there is nothing
8819                                                              preventing a store
8820                                                              release followed by
8821                                                              load acquire from
8822                                                              completing out of
8823                                                              order. The s_waitcnt
8824                                                              could be placed after
8825                                                              seq_store or before
8826                                                              the seq_load. We
8827                                                              choose the load to
8828                                                              make the s_waitcnt be
8829                                                              as late as possible
8830                                                              so that the store
8831                                                              may have already
8832                                                              completed.)
8833
8834                                                          2. *Following
8835                                                             instructions same as
8836                                                             corresponding load
8837                                                             atomic acquire,
8838                                                             except must generate
8839                                                             all instructions even
8840                                                             for OpenCL.*
8841      store atomic seq_cst      - singlethread - global   *Same as corresponding
8842                                - wavefront    - local    store atomic release,
8843                                - workgroup    - generic  except must generate
8844                                - agent                   all instructions even
8845                                - system                  for OpenCL.*
8846      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
8847                                - wavefront    - local    atomicrmw acq_rel,
8848                                - workgroup    - generic  except must generate
8849                                - agent                   all instructions even
8850                                - system                  for OpenCL.*
8851      fence        seq_cst      - singlethread *none*     *Same as corresponding
8852                                - wavefront               fence acq_rel,
8853                                - workgroup               except must generate
8854                                - agent                   all instructions even
8855                                - system                  for OpenCL.*
8856      ============ ============ ============== ========== ================================
8857
8858 .. _amdgpu-amdhsa-memory-model-gfx940:
8859
8860 Memory Model GFX940
8861 +++++++++++++++++++
8862
8863 For GFX940:
8864
8865 * Each agent has multiple shader arrays (SA).
8866 * Each SA has multiple compute units (CU).
8867 * Each CU has multiple SIMDs that execute wavefronts.
8868 * The wavefronts for a single work-group are executed in the same CU but may be
8869   executed by different SIMDs. The exception is when in tgsplit execution mode
8870   when the wavefronts may be executed by different SIMDs in different CUs.
8871 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
8872   executing on it. The exception is when in tgsplit execution mode when no LDS
8873   is allocated as wavefronts of the same work-group can be in different CUs.
8874 * All LDS operations of a CU are performed as wavefront wide operations in a
8875   global order and involve no caching. Completion is reported to a wavefront in
8876   execution order.
8877 * The LDS memory has multiple request queues shared by the SIMDs of a
8878   CU. Therefore, the LDS operations performed by different wavefronts of a
8879   work-group can be reordered relative to each other, which can result in
8880   reordering the visibility of vector memory operations with respect to LDS
8881   operations of other wavefronts in the same work-group. A ``s_waitcnt
8882   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8883   vector memory operations between wavefronts of a work-group, but not between
8884   operations performed by the same wavefront.
8885 * The vector memory operations are performed as wavefront wide operations and
8886   completion is reported to a wavefront in execution order. The exception is
8887   that ``flat_load/store/atomic`` instructions can report out of vector memory
8888   order if they access LDS memory, and out of LDS operation order if they access
8889   global memory.
8890 * The vector memory operations access a single vector L1 cache shared by all
8891   SIMDs a CU. Therefore:
8892
8893   * No special action is required for coherence between the lanes of a single
8894     wavefront.
8895
8896   * No special action is required for coherence between wavefronts in the same
8897     work-group since they execute on the same CU. The exception is when in
8898     tgsplit execution mode as wavefronts of the same work-group can be in
8899     different CUs and so a ``buffer_inv sc0`` is required which will invalidate
8900     the L1 cache.
8901
8902   * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
8903     between wavefronts executing in different work-groups as they may be
8904     executing on different CUs.
8905
8906   * Atomic read-modify-write instructions implicitly bypass the L1 cache.
8907     Therefore, they do not use the sc0 bit for coherence and instead use it to
8908     indicate if the instruction returns the original value being updated. They
8909     do use sc1 to indicate system or agent scope coherence.
8910
8911 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
8912   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
8913   scalar operations are used in a restricted way so do not impact the memory
8914   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
8915 * The vector and scalar memory operations use an L2 cache.
8916
8917   * The gfx940 can be configured as a number of smaller agents with each having
8918     a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
8919     larger agents with groups of CUs on each agent each sharing separate L2
8920     caches.
8921   * The L2 cache has independent channels to service disjoint ranges of virtual
8922     addresses.
8923   * Each CU has a separate request queue per channel for its associated L2.
8924     Therefore, the vector and scalar memory operations performed by wavefronts
8925     executing with different L1 caches and the same L2 cache can be reordered
8926     relative to each other.
8927   * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
8928     vector memory operations of different CUs. It ensures a previous vector
8929     memory operation has completed before executing a subsequent vector memory
8930     or LDS operation and so can be used to meet the requirements of acquire and
8931     release.
8932   * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
8933     (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
8934     the PTE C-bit set for memory not local to the L2.
8935
8936     * Any local memory cache lines will be automatically invalidated by writes
8937       from CUs associated with other L2 caches, or writes from the CPU, due to
8938       the cache probe caused by the PTE C-bit.
8939     * XGMI accesses from the CPU to local memory may be cached on the CPU.
8940       Subsequent access from the GPU will automatically invalidate or writeback
8941       the CPU cache due to the L2 probe filter.
8942     * To ensure coherence of local memory writes of CUs with different L1 caches
8943       in the same agent a ``buffer_wbl2`` is required. It does nothing if the
8944       agent is configured to have a single L2, or will writeback dirty L2 cache
8945       lines if configured to have multiple L2 caches.
8946     * To ensure coherence of local memory writes of CUs in different agents a
8947       ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
8948     * To ensure coherence of local memory reads of CUs with different L1 caches
8949       in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
8950       agent is configured to have a single L2, or will invalidate non-local L2
8951       cache lines if configured to have multiple L2 caches.
8952     * To ensure coherence of local memory reads of CUs in different agents a
8953       ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
8954       lines if configured to have multiple L2 caches.
8955
8956   * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
8957     UC (uncached) which bypasses the L2.
8958
8959 Scalar memory operations are only used to access memory that is proven to not
8960 change during the execution of the kernel dispatch. This includes constant
8961 address space and global address space for program scope ``const`` variables.
8962 Therefore, the kernel machine code does not have to maintain the scalar cache to
8963 ensure it is coherent with the vector caches. The scalar and vector caches are
8964 invalidated between kernel dispatches by CP since constant address space data
8965 may change between kernel dispatch executions. See
8966 :ref:`amdgpu-amdhsa-memory-spaces`.
8967
8968 The one exception is if scalar writes are used to spill SGPR registers. In this
8969 case the AMDGPU backend ensures the memory location used to spill is never
8970 accessed by vector memory operations at the same time. If scalar writes are used
8971 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8972 return since the locations may be used for vector memory instructions by a
8973 future wavefront that uses the same scratch area, or a function call that
8974 creates a frame at the same address, respectively. There is no need for a
8975 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8976
8977 For kernarg backing memory:
8978
8979 * CP invalidates the L1 cache at the start of each kernel dispatch.
8980 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
8981   memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
8982   cache. This also causes it to be treated as non-volatile and so is not
8983   invalidated by ``*_vol``.
8984 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8985   so the L2 cache will be coherent with the CPU and other agents.
8986
8987 Scratch backing memory (which is used for the private address space) is accessed
8988 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
8989 only accessed by a single thread, and is always write-before-read, there is
8990 never a need to invalidate these entries from the L1 cache. Hence all cache
8991 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
8992
8993 The code sequences used to implement the memory model for GFX940 are defined
8994 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`.
8995
8996   .. table:: AMDHSA Memory Model Code Sequences GFX940
8997      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table
8998
8999      ============ ============ ============== ========== ================================
9000      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
9001                   Ordering     Sync Scope     Address    GFX940
9002                                               Space
9003      ============ ============ ============== ========== ================================
9004      **Non-Atomic**
9005      ------------------------------------------------------------------------------------
9006      load         *none*       *none*         - global   - !volatile & !nontemporal
9007                                               - generic
9008                                               - private    1. buffer/global/flat_load
9009                                               - constant
9010                                                          - !volatile & nontemporal
9011
9012                                                            1. buffer/global/flat_load
9013                                                               nt=1
9014
9015                                                          - volatile
9016
9017                                                            1. buffer/global/flat_load
9018                                                               sc0=1 sc1=1
9019                                                            2. s_waitcnt vmcnt(0)
9020
9021                                                             - Must happen before
9022                                                               any following volatile
9023                                                               global/generic
9024                                                               load/store.
9025                                                             - Ensures that
9026                                                               volatile
9027                                                               operations to
9028                                                               different
9029                                                               addresses will not
9030                                                               be reordered by
9031                                                               hardware.
9032
9033      load         *none*       *none*         - local    1. ds_load
9034      store        *none*       *none*         - global   - !volatile & !nontemporal
9035                                               - generic
9036                                               - private    1. buffer/global/flat_store
9037                                               - constant
9038                                                          - !volatile & nontemporal
9039
9040                                                            1. buffer/global/flat_store
9041                                                               nt=1
9042
9043                                                          - volatile
9044
9045                                                            1. buffer/global/flat_store
9046                                                               sc0=1 sc1=1
9047                                                            2. s_waitcnt vmcnt(0)
9048
9049                                                             - Must happen before
9050                                                               any following volatile
9051                                                               global/generic
9052                                                               load/store.
9053                                                             - Ensures that
9054                                                               volatile
9055                                                               operations to
9056                                                               different
9057                                                               addresses will not
9058                                                               be reordered by
9059                                                               hardware.
9060
9061      store        *none*       *none*         - local    1. ds_store
9062      **Unordered Atomic**
9063      ------------------------------------------------------------------------------------
9064      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
9065      store atomic unordered    *any*          *any*      *Same as non-atomic*.
9066      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
9067      **Monotonic Atomic**
9068      ------------------------------------------------------------------------------------
9069      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
9070                                - wavefront    - generic
9071      load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
9072                                               - generic     sc0=1
9073      load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
9074                                - wavefront               local address space cannot
9075                                - workgroup               be used.*
9076
9077                                                          1. ds_load
9078      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
9079                                               - generic     sc1=1
9080      load atomic  monotonic    - system       - global   1. buffer/global/flat_load
9081                                               - generic     sc0=1 sc1=1
9082      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
9083                                - wavefront    - generic
9084      store atomic monotonic    - workgroup    - global   1. buffer/global/flat_store
9085                                               - generic     sc0=1
9086      store atomic monotonic    - agent        - global   1. buffer/global/flat_store
9087                                               - generic     sc1=1
9088      store atomic monotonic    - system       - global   1. buffer/global/flat_store
9089                                               - generic     sc0=1 sc1=1
9090      store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
9091                                - wavefront               local address space cannot
9092                                - workgroup               be used.*
9093
9094                                                          1. ds_store
9095      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
9096                                - wavefront    - generic
9097                                - workgroup
9098                                - agent
9099      atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
9100                                               - generic     sc1=1
9101      atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
9102                                - wavefront               local address space cannot
9103                                - workgroup               be used.*
9104
9105                                                          1. ds_atomic
9106      **Acquire Atomic**
9107      ------------------------------------------------------------------------------------
9108      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
9109                                - wavefront    - local
9110                                               - generic
9111      load atomic  acquire      - workgroup    - global   1. buffer/global_load sc0=1
9112                                                          2. s_waitcnt vmcnt(0)
9113
9114                                                            - If not TgSplit execution
9115                                                              mode, omit.
9116                                                            - Must happen before the
9117                                                              following buffer_inv.
9118
9119                                                          3. buffer_inv sc0=1
9120
9121                                                            - If not TgSplit execution
9122                                                              mode, omit.
9123                                                            - Must happen before
9124                                                              any following
9125                                                              global/generic
9126                                                              load/load
9127                                                              atomic/store/store
9128                                                              atomic/atomicrmw.
9129                                                            - Ensures that
9130                                                              following
9131                                                              loads will not see
9132                                                              stale data.
9133
9134      load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
9135                                                          local address space cannot
9136                                                          be used.*
9137
9138                                                          1. ds_load
9139                                                          2. s_waitcnt lgkmcnt(0)
9140
9141                                                            - If OpenCL, omit.
9142                                                            - Must happen before
9143                                                              any following
9144                                                              global/generic
9145                                                              load/load
9146                                                              atomic/store/store
9147                                                              atomic/atomicrmw.
9148                                                            - Ensures any
9149                                                              following global
9150                                                              data read is no
9151                                                              older than the local load
9152                                                              atomic value being
9153                                                              acquired.
9154
9155      load atomic  acquire      - workgroup    - generic  1. flat_load  sc0=1
9156                                                          2. s_waitcnt lgkm/vmcnt(0)
9157
9158                                                            - Use lgkmcnt(0) if not
9159                                                              TgSplit execution mode
9160                                                              and vmcnt(0) if TgSplit
9161                                                              execution mode.
9162                                                            - If OpenCL, omit lgkmcnt(0).
9163                                                            - Must happen before
9164                                                              the following
9165                                                              buffer_inv and any
9166                                                              following global/generic
9167                                                              load/load
9168                                                              atomic/store/store
9169                                                              atomic/atomicrmw.
9170                                                            - Ensures any
9171                                                              following global
9172                                                              data read is no
9173                                                              older than a local load
9174                                                              atomic value being
9175                                                              acquired.
9176
9177                                                          3. buffer_inv sc0=1
9178
9179                                                            - If not TgSplit execution
9180                                                              mode, omit.
9181                                                            - Ensures that
9182                                                              following
9183                                                              loads will not see
9184                                                              stale data.
9185
9186      load atomic  acquire      - agent        - global   1. buffer/global_load
9187                                                             sc1=1
9188                                                          2. s_waitcnt vmcnt(0)
9189
9190                                                            - Must happen before
9191                                                              following
9192                                                              buffer_inv.
9193                                                            - Ensures the load
9194                                                              has completed
9195                                                              before invalidating
9196                                                              the cache.
9197
9198                                                          3. buffer_inv sc1=1
9199
9200                                                            - Must happen before
9201                                                              any following
9202                                                              global/generic
9203                                                              load/load
9204                                                              atomic/atomicrmw.
9205                                                            - Ensures that
9206                                                              following
9207                                                              loads will not see
9208                                                              stale global data.
9209
9210      load atomic  acquire      - system       - global   1. buffer/global/flat_load
9211                                                             sc0=1 sc1=1
9212                                                          2. s_waitcnt vmcnt(0)
9213
9214                                                            - Must happen before
9215                                                              following
9216                                                              buffer_inv.
9217                                                            - Ensures the load
9218                                                              has completed
9219                                                              before invalidating
9220                                                              the cache.
9221
9222                                                          3. buffer_inv sc0=1 sc1=1
9223
9224                                                            - Must happen before
9225                                                              any following
9226                                                              global/generic
9227                                                              load/load
9228                                                              atomic/atomicrmw.
9229                                                            - Ensures that
9230                                                              following
9231                                                              loads will not see
9232                                                              stale MTYPE NC global data.
9233                                                              MTYPE RW and CC memory will
9234                                                              never be stale due to the
9235                                                              memory probes.
9236
9237      load atomic  acquire      - agent        - generic  1. flat_load sc1=1
9238                                                          2. s_waitcnt vmcnt(0) &
9239                                                             lgkmcnt(0)
9240
9241                                                            - If TgSplit execution mode,
9242                                                              omit lgkmcnt(0).
9243                                                            - If OpenCL omit
9244                                                              lgkmcnt(0).
9245                                                            - Must happen before
9246                                                              following
9247                                                              buffer_inv.
9248                                                            - Ensures the flat_load
9249                                                              has completed
9250                                                              before invalidating
9251                                                              the cache.
9252
9253                                                          3. buffer_inv sc1=1
9254
9255                                                            - Must happen before
9256                                                              any following
9257                                                              global/generic
9258                                                              load/load
9259                                                              atomic/atomicrmw.
9260                                                            - Ensures that
9261                                                              following loads
9262                                                              will not see stale
9263                                                              global data.
9264
9265      load atomic  acquire      - system       - generic  1. flat_load sc0=1 sc1=1
9266                                                          2. s_waitcnt vmcnt(0) &
9267                                                             lgkmcnt(0)
9268
9269                                                            - If TgSplit execution mode,
9270                                                              omit lgkmcnt(0).
9271                                                            - If OpenCL omit
9272                                                              lgkmcnt(0).
9273                                                            - Must happen before
9274                                                              the following
9275                                                              buffer_inv.
9276                                                            - Ensures the flat_load
9277                                                              has completed
9278                                                              before invalidating
9279                                                              the caches.
9280
9281                                                          3. buffer_inv sc0=1 sc1=1
9282
9283                                                            - Must happen before
9284                                                              any following
9285                                                              global/generic
9286                                                              load/load
9287                                                              atomic/atomicrmw.
9288                                                            - Ensures that
9289                                                              following
9290                                                              loads will not see
9291                                                              stale MTYPE NC global data.
9292                                                              MTYPE RW and CC memory will
9293                                                              never be stale due to the
9294                                                              memory probes.
9295
9296      atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
9297                                - wavefront    - generic
9298      atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
9299                                - wavefront               local address space cannot
9300                                                          be used.*
9301
9302                                                          1. ds_atomic
9303      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
9304                                                          2. s_waitcnt vmcnt(0)
9305
9306                                                            - If not TgSplit execution
9307                                                              mode, omit.
9308                                                            - Must happen before the
9309                                                              following buffer_inv.
9310                                                            - Ensures the atomicrmw
9311                                                              has completed
9312                                                              before invalidating
9313                                                              the cache.
9314
9315                                                          3. buffer_inv sc0=1
9316
9317                                                            - If not TgSplit execution
9318                                                              mode, omit.
9319                                                            - Must happen before
9320                                                              any following
9321                                                              global/generic
9322                                                              load/load
9323                                                              atomic/atomicrmw.
9324                                                            - Ensures that
9325                                                              following loads
9326                                                              will not see stale
9327                                                              global data.
9328
9329      atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
9330                                                          local address space cannot
9331                                                          be used.*
9332
9333                                                          1. ds_atomic
9334                                                          2. s_waitcnt lgkmcnt(0)
9335
9336                                                            - If OpenCL, omit.
9337                                                            - Must happen before
9338                                                              any following
9339                                                              global/generic
9340                                                              load/load
9341                                                              atomic/store/store
9342                                                              atomic/atomicrmw.
9343                                                            - Ensures any
9344                                                              following global
9345                                                              data read is no
9346                                                              older than the local
9347                                                              atomicrmw value
9348                                                              being acquired.
9349
9350      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
9351                                                          2. s_waitcnt lgkm/vmcnt(0)
9352
9353                                                            - Use lgkmcnt(0) if not
9354                                                              TgSplit execution mode
9355                                                              and vmcnt(0) if TgSplit
9356                                                              execution mode.
9357                                                            - If OpenCL, omit lgkmcnt(0).
9358                                                            - Must happen before
9359                                                              the following
9360                                                              buffer_inv and
9361                                                              any following
9362                                                              global/generic
9363                                                              load/load
9364                                                              atomic/store/store
9365                                                              atomic/atomicrmw.
9366                                                            - Ensures any
9367                                                              following global
9368                                                              data read is no
9369                                                              older than a local
9370                                                              atomicrmw value
9371                                                              being acquired.
9372
9373                                                          3. buffer_inv sc0=1
9374
9375                                                            - If not TgSplit execution
9376                                                              mode, omit.
9377                                                            - Ensures that
9378                                                              following
9379                                                              loads will not see
9380                                                              stale data.
9381
9382      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
9383                                                          2. s_waitcnt vmcnt(0)
9384
9385                                                            - Must happen before
9386                                                              following
9387                                                              buffer_inv.
9388                                                            - Ensures the
9389                                                              atomicrmw has
9390                                                              completed before
9391                                                              invalidating the
9392                                                              cache.
9393
9394                                                          3. buffer_inv sc1=1
9395
9396                                                            - Must happen before
9397                                                              any following
9398                                                              global/generic
9399                                                              load/load
9400                                                              atomic/atomicrmw.
9401                                                            - Ensures that
9402                                                              following loads
9403                                                              will not see stale
9404                                                              global data.
9405
9406      atomicrmw    acquire      - system       - global   1. buffer/global_atomic
9407                                                             sc1=1
9408                                                          2. s_waitcnt vmcnt(0)
9409
9410                                                            - Must happen before
9411                                                              following
9412                                                              buffer_inv.
9413                                                            - Ensures the
9414                                                              atomicrmw has
9415                                                              completed before
9416                                                              invalidating the
9417                                                              caches.
9418
9419                                                          3. buffer_inv sc0=1 sc1=1
9420
9421                                                            - Must happen before
9422                                                              any following
9423                                                              global/generic
9424                                                              load/load
9425                                                              atomic/atomicrmw.
9426                                                            - Ensures that
9427                                                              following
9428                                                              loads will not see
9429                                                              stale MTYPE NC global data.
9430                                                              MTYPE RW and CC memory will
9431                                                              never be stale due to the
9432                                                              memory probes.
9433
9434      atomicrmw    acquire      - agent        - generic  1. flat_atomic
9435                                                          2. s_waitcnt vmcnt(0) &
9436                                                             lgkmcnt(0)
9437
9438                                                            - If TgSplit execution mode,
9439                                                              omit lgkmcnt(0).
9440                                                            - If OpenCL, omit
9441                                                              lgkmcnt(0).
9442                                                            - Must happen before
9443                                                              following
9444                                                              buffer_inv.
9445                                                            - Ensures the
9446                                                              atomicrmw has
9447                                                              completed before
9448                                                              invalidating the
9449                                                              cache.
9450
9451                                                          3. buffer_inv sc1=1
9452
9453                                                            - Must happen before
9454                                                              any following
9455                                                              global/generic
9456                                                              load/load
9457                                                              atomic/atomicrmw.
9458                                                            - Ensures that
9459                                                              following loads
9460                                                              will not see stale
9461                                                              global data.
9462
9463      atomicrmw    acquire      - system       - generic  1. flat_atomic sc1=1
9464                                                          2. s_waitcnt vmcnt(0) &
9465                                                             lgkmcnt(0)
9466
9467                                                            - If TgSplit execution mode,
9468                                                              omit lgkmcnt(0).
9469                                                            - If OpenCL, omit
9470                                                              lgkmcnt(0).
9471                                                            - Must happen before
9472                                                              following
9473                                                              buffer_inv.
9474                                                            - Ensures the
9475                                                              atomicrmw has
9476                                                              completed before
9477                                                              invalidating the
9478                                                              caches.
9479
9480                                                          3. buffer_inv sc0=1 sc1=1
9481
9482                                                            - Must happen before
9483                                                              any following
9484                                                              global/generic
9485                                                              load/load
9486                                                              atomic/atomicrmw.
9487                                                            - Ensures that
9488                                                              following
9489                                                              loads will not see
9490                                                              stale MTYPE NC global data.
9491                                                              MTYPE RW and CC memory will
9492                                                              never be stale due to the
9493                                                              memory probes.
9494
9495      fence        acquire      - singlethread *none*     *none*
9496                                - wavefront
9497      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
9498
9499                                                            - Use lgkmcnt(0) if not
9500                                                              TgSplit execution mode
9501                                                              and vmcnt(0) if TgSplit
9502                                                              execution mode.
9503                                                            - If OpenCL and
9504                                                              address space is
9505                                                              not generic, omit
9506                                                              lgkmcnt(0).
9507                                                            - If OpenCL and
9508                                                              address space is
9509                                                              local, omit
9510                                                              vmcnt(0).
9511                                                            - However, since LLVM
9512                                                              currently has no
9513                                                              address space on
9514                                                              the fence need to
9515                                                              conservatively
9516                                                              always generate. If
9517                                                              fence had an
9518                                                              address space then
9519                                                              set to address
9520                                                              space of OpenCL
9521                                                              fence flag, or to
9522                                                              generic if both
9523                                                              local and global
9524                                                              flags are
9525                                                              specified.
9526                                                            - s_waitcnt vmcnt(0)
9527                                                              must happen after
9528                                                              any preceding
9529                                                              global/generic load
9530                                                              atomic/
9531                                                              atomicrmw
9532                                                              with an equal or
9533                                                              wider sync scope
9534                                                              and memory ordering
9535                                                              stronger than
9536                                                              unordered (this is
9537                                                              termed the
9538                                                              fence-paired-atomic).
9539                                                            - s_waitcnt lgkmcnt(0)
9540                                                              must happen after
9541                                                              any preceding
9542                                                              local/generic load
9543                                                              atomic/atomicrmw
9544                                                              with an equal or
9545                                                              wider sync scope
9546                                                              and memory ordering
9547                                                              stronger than
9548                                                              unordered (this is
9549                                                              termed the
9550                                                              fence-paired-atomic).
9551                                                            - Must happen before
9552                                                              the following
9553                                                              buffer_inv and
9554                                                              any following
9555                                                              global/generic
9556                                                              load/load
9557                                                              atomic/store/store
9558                                                              atomic/atomicrmw.
9559                                                            - Ensures any
9560                                                              following global
9561                                                              data read is no
9562                                                              older than the
9563                                                              value read by the
9564                                                              fence-paired-atomic.
9565
9566                                                          3. buffer_inv sc0=1
9567
9568                                                            - If not TgSplit execution
9569                                                              mode, omit.
9570                                                            - Ensures that
9571                                                              following
9572                                                              loads will not see
9573                                                              stale data.
9574
9575      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9576                                                             vmcnt(0)
9577
9578                                                            - If TgSplit execution mode,
9579                                                              omit lgkmcnt(0).
9580                                                            - If OpenCL and
9581                                                              address space is
9582                                                              not generic, omit
9583                                                              lgkmcnt(0).
9584                                                            - However, since LLVM
9585                                                              currently has no
9586                                                              address space on
9587                                                              the fence need to
9588                                                              conservatively
9589                                                              always generate
9590                                                              (see comment for
9591                                                              previous fence).
9592                                                            - Could be split into
9593                                                              separate s_waitcnt
9594                                                              vmcnt(0) and
9595                                                              s_waitcnt
9596                                                              lgkmcnt(0) to allow
9597                                                              them to be
9598                                                              independently moved
9599                                                              according to the
9600                                                              following rules.
9601                                                            - s_waitcnt vmcnt(0)
9602                                                              must happen after
9603                                                              any preceding
9604                                                              global/generic load
9605                                                              atomic/atomicrmw
9606                                                              with an equal or
9607                                                              wider sync scope
9608                                                              and memory ordering
9609                                                              stronger than
9610                                                              unordered (this is
9611                                                              termed the
9612                                                              fence-paired-atomic).
9613                                                            - s_waitcnt lgkmcnt(0)
9614                                                              must happen after
9615                                                              any preceding
9616                                                              local/generic load
9617                                                              atomic/atomicrmw
9618                                                              with an equal or
9619                                                              wider sync scope
9620                                                              and memory ordering
9621                                                              stronger than
9622                                                              unordered (this is
9623                                                              termed the
9624                                                              fence-paired-atomic).
9625                                                            - Must happen before
9626                                                              the following
9627                                                              buffer_inv.
9628                                                            - Ensures that the
9629                                                              fence-paired atomic
9630                                                              has completed
9631                                                              before invalidating
9632                                                              the
9633                                                              cache. Therefore
9634                                                              any following
9635                                                              locations read must
9636                                                              be no older than
9637                                                              the value read by
9638                                                              the
9639                                                              fence-paired-atomic.
9640
9641                                                          2. buffer_inv sc1=1
9642
9643                                                            - Must happen before any
9644                                                              following global/generic
9645                                                              load/load
9646                                                              atomic/store/store
9647                                                              atomic/atomicrmw.
9648                                                            - Ensures that
9649                                                              following loads
9650                                                              will not see stale
9651                                                              global data.
9652
9653      fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
9654                                                             vmcnt(0)
9655
9656                                                            - If TgSplit execution mode,
9657                                                              omit lgkmcnt(0).
9658                                                            - If OpenCL and
9659                                                              address space is
9660                                                              not generic, omit
9661                                                              lgkmcnt(0).
9662                                                            - However, since LLVM
9663                                                              currently has no
9664                                                              address space on
9665                                                              the fence need to
9666                                                              conservatively
9667                                                              always generate
9668                                                              (see comment for
9669                                                              previous fence).
9670                                                            - Could be split into
9671                                                              separate s_waitcnt
9672                                                              vmcnt(0) and
9673                                                              s_waitcnt
9674                                                              lgkmcnt(0) to allow
9675                                                              them to be
9676                                                              independently moved
9677                                                              according to the
9678                                                              following rules.
9679                                                            - s_waitcnt vmcnt(0)
9680                                                              must happen after
9681                                                              any preceding
9682                                                              global/generic load
9683                                                              atomic/atomicrmw
9684                                                              with an equal or
9685                                                              wider sync scope
9686                                                              and memory ordering
9687                                                              stronger than
9688                                                              unordered (this is
9689                                                              termed the
9690                                                              fence-paired-atomic).
9691                                                            - s_waitcnt lgkmcnt(0)
9692                                                              must happen after
9693                                                              any preceding
9694                                                              local/generic load
9695                                                              atomic/atomicrmw
9696                                                              with an equal or
9697                                                              wider sync scope
9698                                                              and memory ordering
9699                                                              stronger than
9700                                                              unordered (this is
9701                                                              termed the
9702                                                              fence-paired-atomic).
9703                                                            - Must happen before
9704                                                              the following
9705                                                              buffer_inv.
9706                                                            - Ensures that the
9707                                                              fence-paired atomic
9708                                                              has completed
9709                                                              before invalidating
9710                                                              the
9711                                                              cache. Therefore
9712                                                              any following
9713                                                              locations read must
9714                                                              be no older than
9715                                                              the value read by
9716                                                              the
9717                                                              fence-paired-atomic.
9718
9719                                                          2. buffer_inv sc0=1 sc1=1
9720
9721                                                            - Must happen before any
9722                                                              following global/generic
9723                                                              load/load
9724                                                              atomic/store/store
9725                                                              atomic/atomicrmw.
9726                                                            - Ensures that
9727                                                              following loads
9728                                                              will not see stale
9729                                                              global data.
9730
9731      **Release Atomic**
9732      ------------------------------------------------------------------------------------
9733      store atomic release      - singlethread - global   1. buffer/global/flat_store
9734                                - wavefront    - generic
9735      store atomic release      - singlethread - local    *If TgSplit execution mode,
9736                                - wavefront               local address space cannot
9737                                                          be used.*
9738
9739                                                          1. ds_store
9740      store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9741                                               - generic
9742                                                            - Use lgkmcnt(0) if not
9743                                                              TgSplit execution mode
9744                                                              and vmcnt(0) if TgSplit
9745                                                              execution mode.
9746                                                            - If OpenCL, omit lgkmcnt(0).
9747                                                            - s_waitcnt vmcnt(0)
9748                                                              must happen after
9749                                                              any preceding
9750                                                              global/generic load/store/
9751                                                              load atomic/store atomic/
9752                                                              atomicrmw.
9753                                                            - s_waitcnt lgkmcnt(0)
9754                                                              must happen after
9755                                                              any preceding
9756                                                              local/generic
9757                                                              load/store/load
9758                                                              atomic/store
9759                                                              atomic/atomicrmw.
9760                                                            - Must happen before
9761                                                              the following
9762                                                              store.
9763                                                            - Ensures that all
9764                                                              memory operations
9765                                                              have
9766                                                              completed before
9767                                                              performing the
9768                                                              store that is being
9769                                                              released.
9770
9771                                                          2. buffer/global/flat_store sc0=1
9772      store atomic release      - workgroup    - local    *If TgSplit execution mode,
9773                                                          local address space cannot
9774                                                          be used.*
9775
9776                                                          1. ds_store
9777      store atomic release      - agent        - global   1. buffer_wbl2 sc1=1
9778                                               - generic
9779                                                            - Must happen before
9780                                                              following s_waitcnt.
9781                                                            - Performs L2 writeback to
9782                                                              ensure previous
9783                                                              global/generic
9784                                                              store/atomicrmw are
9785                                                              visible at agent scope.
9786
9787                                                          2. s_waitcnt lgkmcnt(0) &
9788                                                             vmcnt(0)
9789
9790                                                            - If TgSplit execution mode,
9791                                                              omit lgkmcnt(0).
9792                                                            - If OpenCL and
9793                                                              address space is
9794                                                              not generic, omit
9795                                                              lgkmcnt(0).
9796                                                            - Could be split into
9797                                                              separate s_waitcnt
9798                                                              vmcnt(0) and
9799                                                              s_waitcnt
9800                                                              lgkmcnt(0) to allow
9801                                                              them to be
9802                                                              independently moved
9803                                                              according to the
9804                                                              following rules.
9805                                                            - s_waitcnt vmcnt(0)
9806                                                              must happen after
9807                                                              any preceding
9808                                                              global/generic
9809                                                              load/store/load
9810                                                              atomic/store
9811                                                              atomic/atomicrmw.
9812                                                            - s_waitcnt lgkmcnt(0)
9813                                                              must happen after
9814                                                              any preceding
9815                                                              local/generic
9816                                                              load/store/load
9817                                                              atomic/store
9818                                                              atomic/atomicrmw.
9819                                                            - Must happen before
9820                                                              the following
9821                                                              store.
9822                                                            - Ensures that all
9823                                                              memory operations
9824                                                              to memory have
9825                                                              completed before
9826                                                              performing the
9827                                                              store that is being
9828                                                              released.
9829
9830                                                          3. buffer/global/flat_store sc1=1
9831      store atomic release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
9832                                               - generic
9833                                                            - Must happen before
9834                                                              following s_waitcnt.
9835                                                            - Performs L2 writeback to
9836                                                              ensure previous
9837                                                              global/generic
9838                                                              store/atomicrmw are
9839                                                              visible at system scope.
9840
9841                                                          2. s_waitcnt lgkmcnt(0) &
9842                                                             vmcnt(0)
9843
9844                                                            - If TgSplit execution mode,
9845                                                              omit lgkmcnt(0).
9846                                                            - If OpenCL and
9847                                                              address space is
9848                                                              not generic, omit
9849                                                              lgkmcnt(0).
9850                                                            - Could be split into
9851                                                              separate s_waitcnt
9852                                                              vmcnt(0) and
9853                                                              s_waitcnt
9854                                                              lgkmcnt(0) to allow
9855                                                              them to be
9856                                                              independently moved
9857                                                              according to the
9858                                                              following rules.
9859                                                            - s_waitcnt vmcnt(0)
9860                                                              must happen after any
9861                                                              preceding
9862                                                              global/generic
9863                                                              load/store/load
9864                                                              atomic/store
9865                                                              atomic/atomicrmw.
9866                                                            - s_waitcnt lgkmcnt(0)
9867                                                              must happen after any
9868                                                              preceding
9869                                                              local/generic
9870                                                              load/store/load
9871                                                              atomic/store
9872                                                              atomic/atomicrmw.
9873                                                            - Must happen before
9874                                                              the following
9875                                                              store.
9876                                                            - Ensures that all
9877                                                              memory operations
9878                                                              to memory and the L2
9879                                                              writeback have
9880                                                              completed before
9881                                                              performing the
9882                                                              store that is being
9883                                                              released.
9884
9885                                                          3. buffer/global/flat_store
9886                                                             sc0=1 sc1=1
9887      atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
9888                                - wavefront    - generic
9889      atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
9890                                - wavefront               local address space cannot
9891                                                          be used.*
9892
9893                                                          1. ds_atomic
9894      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9895                                               - generic
9896                                                            - Use lgkmcnt(0) if not
9897                                                              TgSplit execution mode
9898                                                              and vmcnt(0) if TgSplit
9899                                                              execution mode.
9900                                                            - If OpenCL, omit
9901                                                              lgkmcnt(0).
9902                                                            - s_waitcnt vmcnt(0)
9903                                                              must happen after
9904                                                              any preceding
9905                                                              global/generic load/store/
9906                                                              load atomic/store atomic/
9907                                                              atomicrmw.
9908                                                            - s_waitcnt lgkmcnt(0)
9909                                                              must happen after
9910                                                              any preceding
9911                                                              local/generic
9912                                                              load/store/load
9913                                                              atomic/store
9914                                                              atomic/atomicrmw.
9915                                                            - Must happen before
9916                                                              the following
9917                                                              atomicrmw.
9918                                                            - Ensures that all
9919                                                              memory operations
9920                                                              have
9921                                                              completed before
9922                                                              performing the
9923                                                              atomicrmw that is
9924                                                              being released.
9925
9926                                                          2. buffer/global/flat_atomic sc0=1
9927      atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
9928                                                          local address space cannot
9929                                                          be used.*
9930
9931                                                          1. ds_atomic
9932      atomicrmw    release      - agent        - global   1. buffer_wbl2 sc1=1
9933                                               - generic
9934                                                            - Must happen before
9935                                                              following s_waitcnt.
9936                                                            - Performs L2 writeback to
9937                                                              ensure previous
9938                                                              global/generic
9939                                                              store/atomicrmw are
9940                                                              visible at agent scope.
9941
9942                                                          2. s_waitcnt lgkmcnt(0) &
9943                                                             vmcnt(0)
9944
9945                                                            - If TgSplit execution mode,
9946                                                              omit lgkmcnt(0).
9947                                                            - If OpenCL, omit
9948                                                              lgkmcnt(0).
9949                                                            - Could be split into
9950                                                              separate s_waitcnt
9951                                                              vmcnt(0) and
9952                                                              s_waitcnt
9953                                                              lgkmcnt(0) to allow
9954                                                              them to be
9955                                                              independently moved
9956                                                              according to the
9957                                                              following rules.
9958                                                            - s_waitcnt vmcnt(0)
9959                                                              must happen after
9960                                                              any preceding
9961                                                              global/generic
9962                                                              load/store/load
9963                                                              atomic/store
9964                                                              atomic/atomicrmw.
9965                                                            - s_waitcnt lgkmcnt(0)
9966                                                              must happen after
9967                                                              any preceding
9968                                                              local/generic
9969                                                              load/store/load
9970                                                              atomic/store
9971                                                              atomic/atomicrmw.
9972                                                            - Must happen before
9973                                                              the following
9974                                                              atomicrmw.
9975                                                            - Ensures that all
9976                                                              memory operations
9977                                                              to global and local
9978                                                              have completed
9979                                                              before performing
9980                                                              the atomicrmw that
9981                                                              is being released.
9982
9983                                                          3. buffer/global/flat_atomic sc1=1
9984      atomicrmw    release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
9985                                               - generic
9986                                                            - Must happen before
9987                                                              following s_waitcnt.
9988                                                            - Performs L2 writeback to
9989                                                              ensure previous
9990                                                              global/generic
9991                                                              store/atomicrmw are
9992                                                              visible at system scope.
9993
9994                                                          2. s_waitcnt lgkmcnt(0) &
9995                                                             vmcnt(0)
9996
9997                                                            - If TgSplit execution mode,
9998                                                              omit lgkmcnt(0).
9999                                                            - If OpenCL, omit
10000                                                              lgkmcnt(0).
10001                                                            - Could be split into
10002                                                              separate s_waitcnt
10003                                                              vmcnt(0) and
10004                                                              s_waitcnt
10005                                                              lgkmcnt(0) to allow
10006                                                              them to be
10007                                                              independently moved
10008                                                              according to the
10009                                                              following rules.
10010                                                            - s_waitcnt vmcnt(0)
10011                                                              must happen after
10012                                                              any preceding
10013                                                              global/generic
10014                                                              load/store/load
10015                                                              atomic/store
10016                                                              atomic/atomicrmw.
10017                                                            - s_waitcnt lgkmcnt(0)
10018                                                              must happen after
10019                                                              any preceding
10020                                                              local/generic
10021                                                              load/store/load
10022                                                              atomic/store
10023                                                              atomic/atomicrmw.
10024                                                            - Must happen before
10025                                                              the following
10026                                                              atomicrmw.
10027                                                            - Ensures that all
10028                                                              memory operations
10029                                                              to memory and the L2
10030                                                              writeback have
10031                                                              completed before
10032                                                              performing the
10033                                                              store that is being
10034                                                              released.
10035
10036                                                          3. buffer/global/flat_atomic
10037                                                             sc0=1 sc1=1
10038      fence        release      - singlethread *none*     *none*
10039                                - wavefront
10040      fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
10041
10042                                                            - Use lgkmcnt(0) if not
10043                                                              TgSplit execution mode
10044                                                              and vmcnt(0) if TgSplit
10045                                                              execution mode.
10046                                                            - If OpenCL and
10047                                                              address space is
10048                                                              not generic, omit
10049                                                              lgkmcnt(0).
10050                                                            - If OpenCL and
10051                                                              address space is
10052                                                              local, omit
10053                                                              vmcnt(0).
10054                                                            - However, since LLVM
10055                                                              currently has no
10056                                                              address space on
10057                                                              the fence need to
10058                                                              conservatively
10059                                                              always generate. If
10060                                                              fence had an
10061                                                              address space then
10062                                                              set to address
10063                                                              space of OpenCL
10064                                                              fence flag, or to
10065                                                              generic if both
10066                                                              local and global
10067                                                              flags are
10068                                                              specified.
10069                                                            - s_waitcnt vmcnt(0)
10070                                                              must happen after
10071                                                              any preceding
10072                                                              global/generic
10073                                                              load/store/
10074                                                              load atomic/store atomic/
10075                                                              atomicrmw.
10076                                                            - s_waitcnt lgkmcnt(0)
10077                                                              must happen after
10078                                                              any preceding
10079                                                              local/generic
10080                                                              load/load
10081                                                              atomic/store/store
10082                                                              atomic/atomicrmw.
10083                                                            - Must happen before
10084                                                              any following store
10085                                                              atomic/atomicrmw
10086                                                              with an equal or
10087                                                              wider sync scope
10088                                                              and memory ordering
10089                                                              stronger than
10090                                                              unordered (this is
10091                                                              termed the
10092                                                              fence-paired-atomic).
10093                                                            - Ensures that all
10094                                                              memory operations
10095                                                              have
10096                                                              completed before
10097                                                              performing the
10098                                                              following
10099                                                              fence-paired-atomic.
10100
10101      fence        release      - agent        *none*     1. buffer_wbl2 sc1=1
10102
10103                                                            - If OpenCL and
10104                                                              address space is
10105                                                              local, omit.
10106                                                            - Must happen before
10107                                                              following s_waitcnt.
10108                                                            - Performs L2 writeback to
10109                                                              ensure previous
10110                                                              global/generic
10111                                                              store/atomicrmw are
10112                                                              visible at agent scope.
10113
10114                                                          2. s_waitcnt lgkmcnt(0) &
10115                                                             vmcnt(0)
10116
10117                                                            - If TgSplit execution mode,
10118                                                              omit lgkmcnt(0).
10119                                                            - If OpenCL and
10120                                                              address space is
10121                                                              not generic, omit
10122                                                              lgkmcnt(0).
10123                                                            - If OpenCL and
10124                                                              address space is
10125                                                              local, omit
10126                                                              vmcnt(0).
10127                                                            - However, since LLVM
10128                                                              currently has no
10129                                                              address space on
10130                                                              the fence need to
10131                                                              conservatively
10132                                                              always generate. If
10133                                                              fence had an
10134                                                              address space then
10135                                                              set to address
10136                                                              space of OpenCL
10137                                                              fence flag, or to
10138                                                              generic if both
10139                                                              local and global
10140                                                              flags are
10141                                                              specified.
10142                                                            - Could be split into
10143                                                              separate s_waitcnt
10144                                                              vmcnt(0) and
10145                                                              s_waitcnt
10146                                                              lgkmcnt(0) to allow
10147                                                              them to be
10148                                                              independently moved
10149                                                              according to the
10150                                                              following rules.
10151                                                            - s_waitcnt vmcnt(0)
10152                                                              must happen after
10153                                                              any preceding
10154                                                              global/generic
10155                                                              load/store/load
10156                                                              atomic/store
10157                                                              atomic/atomicrmw.
10158                                                            - s_waitcnt lgkmcnt(0)
10159                                                              must happen after
10160                                                              any preceding
10161                                                              local/generic
10162                                                              load/store/load
10163                                                              atomic/store
10164                                                              atomic/atomicrmw.
10165                                                            - Must happen before
10166                                                              any following store
10167                                                              atomic/atomicrmw
10168                                                              with an equal or
10169                                                              wider sync scope
10170                                                              and memory ordering
10171                                                              stronger than
10172                                                              unordered (this is
10173                                                              termed the
10174                                                              fence-paired-atomic).
10175                                                            - Ensures that all
10176                                                              memory operations
10177                                                              have
10178                                                              completed before
10179                                                              performing the
10180                                                              following
10181                                                              fence-paired-atomic.
10182
10183      fence        release      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
10184
10185                                                            - Must happen before
10186                                                              following s_waitcnt.
10187                                                            - Performs L2 writeback to
10188                                                              ensure previous
10189                                                              global/generic
10190                                                              store/atomicrmw are
10191                                                              visible at system scope.
10192
10193                                                          2. s_waitcnt lgkmcnt(0) &
10194                                                             vmcnt(0)
10195
10196                                                            - If TgSplit execution mode,
10197                                                              omit lgkmcnt(0).
10198                                                            - If OpenCL and
10199                                                              address space is
10200                                                              not generic, omit
10201                                                              lgkmcnt(0).
10202                                                            - If OpenCL and
10203                                                              address space is
10204                                                              local, omit
10205                                                              vmcnt(0).
10206                                                            - However, since LLVM
10207                                                              currently has no
10208                                                              address space on
10209                                                              the fence need to
10210                                                              conservatively
10211                                                              always generate. If
10212                                                              fence had an
10213                                                              address space then
10214                                                              set to address
10215                                                              space of OpenCL
10216                                                              fence flag, or to
10217                                                              generic if both
10218                                                              local and global
10219                                                              flags are
10220                                                              specified.
10221                                                            - Could be split into
10222                                                              separate s_waitcnt
10223                                                              vmcnt(0) and
10224                                                              s_waitcnt
10225                                                              lgkmcnt(0) to allow
10226                                                              them to be
10227                                                              independently moved
10228                                                              according to the
10229                                                              following rules.
10230                                                            - s_waitcnt vmcnt(0)
10231                                                              must happen after
10232                                                              any preceding
10233                                                              global/generic
10234                                                              load/store/load
10235                                                              atomic/store
10236                                                              atomic/atomicrmw.
10237                                                            - s_waitcnt lgkmcnt(0)
10238                                                              must happen after
10239                                                              any preceding
10240                                                              local/generic
10241                                                              load/store/load
10242                                                              atomic/store
10243                                                              atomic/atomicrmw.
10244                                                            - Must happen before
10245                                                              any following store
10246                                                              atomic/atomicrmw
10247                                                              with an equal or
10248                                                              wider sync scope
10249                                                              and memory ordering
10250                                                              stronger than
10251                                                              unordered (this is
10252                                                              termed the
10253                                                              fence-paired-atomic).
10254                                                            - Ensures that all
10255                                                              memory operations
10256                                                              have
10257                                                              completed before
10258                                                              performing the
10259                                                              following
10260                                                              fence-paired-atomic.
10261
10262      **Acquire-Release Atomic**
10263      ------------------------------------------------------------------------------------
10264      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
10265                                - wavefront    - generic
10266      atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
10267                                - wavefront               local address space cannot
10268                                                          be used.*
10269
10270                                                          1. ds_atomic
10271      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
10272
10273                                                            - Use lgkmcnt(0) if not
10274                                                              TgSplit execution mode
10275                                                              and vmcnt(0) if TgSplit
10276                                                              execution mode.
10277                                                            - If OpenCL, omit
10278                                                              lgkmcnt(0).
10279                                                            - Must happen after
10280                                                              any preceding
10281                                                              local/generic
10282                                                              load/store/load
10283                                                              atomic/store
10284                                                              atomic/atomicrmw.
10285                                                            - s_waitcnt vmcnt(0)
10286                                                              must happen after
10287                                                              any preceding
10288                                                              global/generic load/store/
10289                                                              load atomic/store atomic/
10290                                                              atomicrmw.
10291                                                            - s_waitcnt lgkmcnt(0)
10292                                                              must happen after
10293                                                              any preceding
10294                                                              local/generic
10295                                                              load/store/load
10296                                                              atomic/store
10297                                                              atomic/atomicrmw.
10298                                                            - Must happen before
10299                                                              the following
10300                                                              atomicrmw.
10301                                                            - Ensures that all
10302                                                              memory operations
10303                                                              have
10304                                                              completed before
10305                                                              performing the
10306                                                              atomicrmw that is
10307                                                              being released.
10308
10309                                                          2. buffer/global_atomic
10310                                                          3. s_waitcnt vmcnt(0)
10311
10312                                                            - If not TgSplit execution
10313                                                              mode, omit.
10314                                                            - Must happen before
10315                                                              the following
10316                                                              buffer_inv.
10317                                                            - Ensures any
10318                                                              following global
10319                                                              data read is no
10320                                                              older than the
10321                                                              atomicrmw value
10322                                                              being acquired.
10323
10324                                                          4. buffer_inv sc0=1
10325
10326                                                            - If not TgSplit execution
10327                                                              mode, omit.
10328                                                            - Ensures that
10329                                                              following
10330                                                              loads will not see
10331                                                              stale data.
10332
10333      atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
10334                                                          local address space cannot
10335                                                          be used.*
10336
10337                                                          1. ds_atomic
10338                                                          2. s_waitcnt lgkmcnt(0)
10339
10340                                                            - If OpenCL, omit.
10341                                                            - Must happen before
10342                                                              any following
10343                                                              global/generic
10344                                                              load/load
10345                                                              atomic/store/store
10346                                                              atomic/atomicrmw.
10347                                                            - Ensures any
10348                                                              following global
10349                                                              data read is no
10350                                                              older than the local load
10351                                                              atomic value being
10352                                                              acquired.
10353
10354      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
10355
10356                                                            - Use lgkmcnt(0) if not
10357                                                              TgSplit execution mode
10358                                                              and vmcnt(0) if TgSplit
10359                                                              execution mode.
10360                                                            - If OpenCL, omit
10361                                                              lgkmcnt(0).
10362                                                            - s_waitcnt vmcnt(0)
10363                                                              must happen after
10364                                                              any preceding
10365                                                              global/generic load/store/
10366                                                              load atomic/store atomic/
10367                                                              atomicrmw.
10368                                                            - s_waitcnt lgkmcnt(0)
10369                                                              must happen after
10370                                                              any preceding
10371                                                              local/generic
10372                                                              load/store/load
10373                                                              atomic/store
10374                                                              atomic/atomicrmw.
10375                                                            - Must happen before
10376                                                              the following
10377                                                              atomicrmw.
10378                                                            - Ensures that all
10379                                                              memory operations
10380                                                              have
10381                                                              completed before
10382                                                              performing the
10383                                                              atomicrmw that is
10384                                                              being released.
10385
10386                                                          2. flat_atomic
10387                                                          3. s_waitcnt lgkmcnt(0) &
10388                                                             vmcnt(0)
10389
10390                                                            - If not TgSplit execution
10391                                                              mode, omit vmcnt(0).
10392                                                            - If OpenCL, omit
10393                                                              lgkmcnt(0).
10394                                                            - Must happen before
10395                                                              the following
10396                                                              buffer_inv and
10397                                                              any following
10398                                                              global/generic
10399                                                              load/load
10400                                                              atomic/store/store
10401                                                              atomic/atomicrmw.
10402                                                            - Ensures any
10403                                                              following global
10404                                                              data read is no
10405                                                              older than a local load
10406                                                              atomic value being
10407                                                              acquired.
10408
10409                                                          3. buffer_inv sc0=1
10410
10411                                                            - If not TgSplit execution
10412                                                              mode, omit.
10413                                                            - Ensures that
10414                                                              following
10415                                                              loads will not see
10416                                                              stale data.
10417
10418      atomicrmw    acq_rel      - agent        - global   1. buffer_wbl2 sc1=1
10419
10420                                                            - Must happen before
10421                                                              following s_waitcnt.
10422                                                            - Performs L2 writeback to
10423                                                              ensure previous
10424                                                              global/generic
10425                                                              store/atomicrmw are
10426                                                              visible at agent scope.
10427
10428                                                          2. s_waitcnt lgkmcnt(0) &
10429                                                             vmcnt(0)
10430
10431                                                            - If TgSplit execution mode,
10432                                                              omit lgkmcnt(0).
10433                                                            - If OpenCL, omit
10434                                                              lgkmcnt(0).
10435                                                            - Could be split into
10436                                                              separate s_waitcnt
10437                                                              vmcnt(0) and
10438                                                              s_waitcnt
10439                                                              lgkmcnt(0) to allow
10440                                                              them to be
10441                                                              independently moved
10442                                                              according to the
10443                                                              following rules.
10444                                                            - s_waitcnt vmcnt(0)
10445                                                              must happen after
10446                                                              any preceding
10447                                                              global/generic
10448                                                              load/store/load
10449                                                              atomic/store
10450                                                              atomic/atomicrmw.
10451                                                            - s_waitcnt lgkmcnt(0)
10452                                                              must happen after
10453                                                              any preceding
10454                                                              local/generic
10455                                                              load/store/load
10456                                                              atomic/store
10457                                                              atomic/atomicrmw.
10458                                                            - Must happen before
10459                                                              the following
10460                                                              atomicrmw.
10461                                                            - Ensures that all
10462                                                              memory operations
10463                                                              to global have
10464                                                              completed before
10465                                                              performing the
10466                                                              atomicrmw that is
10467                                                              being released.
10468
10469                                                          3. buffer/global_atomic
10470                                                          4. s_waitcnt vmcnt(0)
10471
10472                                                            - Must happen before
10473                                                              following
10474                                                              buffer_inv.
10475                                                            - Ensures the
10476                                                              atomicrmw has
10477                                                              completed before
10478                                                              invalidating the
10479                                                              cache.
10480
10481                                                          5. buffer_inv sc1=1
10482
10483                                                            - Must happen before
10484                                                              any following
10485                                                              global/generic
10486                                                              load/load
10487                                                              atomic/atomicrmw.
10488                                                            - Ensures that
10489                                                              following loads
10490                                                              will not see stale
10491                                                              global data.
10492
10493      atomicrmw    acq_rel      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
10494
10495                                                            - Must happen before
10496                                                              following s_waitcnt.
10497                                                            - Performs L2 writeback to
10498                                                              ensure previous
10499                                                              global/generic
10500                                                              store/atomicrmw are
10501                                                              visible at system scope.
10502
10503                                                          2. s_waitcnt lgkmcnt(0) &
10504                                                             vmcnt(0)
10505
10506                                                            - If TgSplit execution mode,
10507                                                              omit lgkmcnt(0).
10508                                                            - If OpenCL, omit
10509                                                              lgkmcnt(0).
10510                                                            - Could be split into
10511                                                              separate s_waitcnt
10512                                                              vmcnt(0) and
10513                                                              s_waitcnt
10514                                                              lgkmcnt(0) to allow
10515                                                              them to be
10516                                                              independently moved
10517                                                              according to the
10518                                                              following rules.
10519                                                            - s_waitcnt vmcnt(0)
10520                                                              must happen after
10521                                                              any preceding
10522                                                              global/generic
10523                                                              load/store/load
10524                                                              atomic/store
10525                                                              atomic/atomicrmw.
10526                                                            - s_waitcnt lgkmcnt(0)
10527                                                              must happen after
10528                                                              any preceding
10529                                                              local/generic
10530                                                              load/store/load
10531                                                              atomic/store
10532                                                              atomic/atomicrmw.
10533                                                            - Must happen before
10534                                                              the following
10535                                                              atomicrmw.
10536                                                            - Ensures that all
10537                                                              memory operations
10538                                                              to global and L2 writeback
10539                                                              have completed before
10540                                                              performing the
10541                                                              atomicrmw that is
10542                                                              being released.
10543
10544                                                          3. buffer/global_atomic
10545                                                             sc1=1
10546                                                          4. s_waitcnt vmcnt(0)
10547
10548                                                            - Must happen before
10549                                                              following
10550                                                              buffer_inv.
10551                                                            - Ensures the
10552                                                              atomicrmw has
10553                                                              completed before
10554                                                              invalidating the
10555                                                              caches.
10556
10557                                                          5. buffer_inv sc0=1 sc1=1
10558
10559                                                            - Must happen before
10560                                                              any following
10561                                                              global/generic
10562                                                              load/load
10563                                                              atomic/atomicrmw.
10564                                                            - Ensures that
10565                                                              following loads
10566                                                              will not see stale
10567                                                              MTYPE NC global data.
10568                                                              MTYPE RW and CC memory will
10569                                                              never be stale due to the
10570                                                              memory probes.
10571
10572      atomicrmw    acq_rel      - agent        - generic  1. buffer_wbl2 sc1=1
10573
10574                                                            - Must happen before
10575                                                              following s_waitcnt.
10576                                                            - Performs L2 writeback to
10577                                                              ensure previous
10578                                                              global/generic
10579                                                              store/atomicrmw are
10580                                                              visible at agent scope.
10581
10582                                                          2. s_waitcnt lgkmcnt(0) &
10583                                                             vmcnt(0)
10584
10585                                                            - If TgSplit execution mode,
10586                                                              omit lgkmcnt(0).
10587                                                            - If OpenCL, omit
10588                                                              lgkmcnt(0).
10589                                                            - Could be split into
10590                                                              separate s_waitcnt
10591                                                              vmcnt(0) and
10592                                                              s_waitcnt
10593                                                              lgkmcnt(0) to allow
10594                                                              them to be
10595                                                              independently moved
10596                                                              according to the
10597                                                              following rules.
10598                                                            - s_waitcnt vmcnt(0)
10599                                                              must happen after
10600                                                              any preceding
10601                                                              global/generic
10602                                                              load/store/load
10603                                                              atomic/store
10604                                                              atomic/atomicrmw.
10605                                                            - s_waitcnt lgkmcnt(0)
10606                                                              must happen after
10607                                                              any preceding
10608                                                              local/generic
10609                                                              load/store/load
10610                                                              atomic/store
10611                                                              atomic/atomicrmw.
10612                                                            - Must happen before
10613                                                              the following
10614                                                              atomicrmw.
10615                                                            - Ensures that all
10616                                                              memory operations
10617                                                              to global have
10618                                                              completed before
10619                                                              performing the
10620                                                              atomicrmw that is
10621                                                              being released.
10622
10623                                                          3. flat_atomic
10624                                                          4. s_waitcnt vmcnt(0) &
10625                                                             lgkmcnt(0)
10626
10627                                                            - If TgSplit execution mode,
10628                                                              omit lgkmcnt(0).
10629                                                            - If OpenCL, omit
10630                                                              lgkmcnt(0).
10631                                                            - Must happen before
10632                                                              following
10633                                                              buffer_inv.
10634                                                            - Ensures the
10635                                                              atomicrmw has
10636                                                              completed before
10637                                                              invalidating the
10638                                                              cache.
10639
10640                                                          5. buffer_inv sc1=1
10641
10642                                                            - Must happen before
10643                                                              any following
10644                                                              global/generic
10645                                                              load/load
10646                                                              atomic/atomicrmw.
10647                                                            - Ensures that
10648                                                              following loads
10649                                                              will not see stale
10650                                                              global data.
10651
10652      atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2 sc0=1 sc1=1
10653
10654                                                            - Must happen before
10655                                                              following s_waitcnt.
10656                                                            - Performs L2 writeback to
10657                                                              ensure previous
10658                                                              global/generic
10659                                                              store/atomicrmw are
10660                                                              visible at system scope.
10661
10662                                                          2. s_waitcnt lgkmcnt(0) &
10663                                                             vmcnt(0)
10664
10665                                                            - If TgSplit execution mode,
10666                                                              omit lgkmcnt(0).
10667                                                            - If OpenCL, omit
10668                                                              lgkmcnt(0).
10669                                                            - Could be split into
10670                                                              separate s_waitcnt
10671                                                              vmcnt(0) and
10672                                                              s_waitcnt
10673                                                              lgkmcnt(0) to allow
10674                                                              them to be
10675                                                              independently moved
10676                                                              according to the
10677                                                              following rules.
10678                                                            - s_waitcnt vmcnt(0)
10679                                                              must happen after
10680                                                              any preceding
10681                                                              global/generic
10682                                                              load/store/load
10683                                                              atomic/store
10684                                                              atomic/atomicrmw.
10685                                                            - s_waitcnt lgkmcnt(0)
10686                                                              must happen after
10687                                                              any preceding
10688                                                              local/generic
10689                                                              load/store/load
10690                                                              atomic/store
10691                                                              atomic/atomicrmw.
10692                                                            - Must happen before
10693                                                              the following
10694                                                              atomicrmw.
10695                                                            - Ensures that all
10696                                                              memory operations
10697                                                              to global and L2 writeback
10698                                                              have completed before
10699                                                              performing the
10700                                                              atomicrmw that is
10701                                                              being released.
10702
10703                                                          3. flat_atomic sc1=1
10704                                                          4. s_waitcnt vmcnt(0) &
10705                                                             lgkmcnt(0)
10706
10707                                                            - If TgSplit execution mode,
10708                                                              omit lgkmcnt(0).
10709                                                            - If OpenCL, omit
10710                                                              lgkmcnt(0).
10711                                                            - Must happen before
10712                                                              following
10713                                                              buffer_inv.
10714                                                            - Ensures the
10715                                                              atomicrmw has
10716                                                              completed before
10717                                                              invalidating the
10718                                                              caches.
10719
10720                                                          5. buffer_inv sc0=1 sc1=1
10721
10722                                                            - Must happen before
10723                                                              any following
10724                                                              global/generic
10725                                                              load/load
10726                                                              atomic/atomicrmw.
10727                                                            - Ensures that
10728                                                              following loads
10729                                                              will not see stale
10730                                                              MTYPE NC global data.
10731                                                              MTYPE RW and CC memory will
10732                                                              never be stale due to the
10733                                                              memory probes.
10734
10735      fence        acq_rel      - singlethread *none*     *none*
10736                                - wavefront
10737      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
10738
10739                                                            - Use lgkmcnt(0) if not
10740                                                              TgSplit execution mode
10741                                                              and vmcnt(0) if TgSplit
10742                                                              execution mode.
10743                                                            - If OpenCL and
10744                                                              address space is
10745                                                              not generic, omit
10746                                                              lgkmcnt(0).
10747                                                            - If OpenCL and
10748                                                              address space is
10749                                                              local, omit
10750                                                              vmcnt(0).
10751                                                            - However,
10752                                                              since LLVM
10753                                                              currently has no
10754                                                              address space on
10755                                                              the fence need to
10756                                                              conservatively
10757                                                              always generate
10758                                                              (see comment for
10759                                                              previous fence).
10760                                                            - s_waitcnt vmcnt(0)
10761                                                              must happen after
10762                                                              any preceding
10763                                                              global/generic
10764                                                              load/store/
10765                                                              load atomic/store atomic/
10766                                                              atomicrmw.
10767                                                            - s_waitcnt lgkmcnt(0)
10768                                                              must happen after
10769                                                              any preceding
10770                                                              local/generic
10771                                                              load/load
10772                                                              atomic/store/store
10773                                                              atomic/atomicrmw.
10774                                                            - Must happen before
10775                                                              any following
10776                                                              global/generic
10777                                                              load/load
10778                                                              atomic/store/store
10779                                                              atomic/atomicrmw.
10780                                                            - Ensures that all
10781                                                              memory operations
10782                                                              have
10783                                                              completed before
10784                                                              performing any
10785                                                              following global
10786                                                              memory operations.
10787                                                            - Ensures that the
10788                                                              preceding
10789                                                              local/generic load
10790                                                              atomic/atomicrmw
10791                                                              with an equal or
10792                                                              wider sync scope
10793                                                              and memory ordering
10794                                                              stronger than
10795                                                              unordered (this is
10796                                                              termed the
10797                                                              acquire-fence-paired-atomic)
10798                                                              has completed
10799                                                              before following
10800                                                              global memory
10801                                                              operations. This
10802                                                              satisfies the
10803                                                              requirements of
10804                                                              acquire.
10805                                                            - Ensures that all
10806                                                              previous memory
10807                                                              operations have
10808                                                              completed before a
10809                                                              following
10810                                                              local/generic store
10811                                                              atomic/atomicrmw
10812                                                              with an equal or
10813                                                              wider sync scope
10814                                                              and memory ordering
10815                                                              stronger than
10816                                                              unordered (this is
10817                                                              termed the
10818                                                              release-fence-paired-atomic).
10819                                                              This satisfies the
10820                                                              requirements of
10821                                                              release.
10822                                                            - Must happen before
10823                                                              the following
10824                                                              buffer_inv.
10825                                                            - Ensures that the
10826                                                              acquire-fence-paired
10827                                                              atomic has completed
10828                                                              before invalidating
10829                                                              the
10830                                                              cache. Therefore
10831                                                              any following
10832                                                              locations read must
10833                                                              be no older than
10834                                                              the value read by
10835                                                              the
10836                                                              acquire-fence-paired-atomic.
10837
10838                                                          3. buffer_inv sc0=1
10839
10840                                                            - If not TgSplit execution
10841                                                              mode, omit.
10842                                                            - Ensures that
10843                                                              following
10844                                                              loads will not see
10845                                                              stale data.
10846
10847      fence        acq_rel      - agent        *none*     1. buffer_wbl2 sc1=1
10848
10849                                                            - If OpenCL and
10850                                                              address space is
10851                                                              local, omit.
10852                                                            - Must happen before
10853                                                              following s_waitcnt.
10854                                                            - Performs L2 writeback to
10855                                                              ensure previous
10856                                                              global/generic
10857                                                              store/atomicrmw are
10858                                                              visible at agent scope.
10859
10860                                                          2. s_waitcnt lgkmcnt(0) &
10861                                                             vmcnt(0)
10862
10863                                                            - If TgSplit execution mode,
10864                                                              omit lgkmcnt(0).
10865                                                            - If OpenCL and
10866                                                              address space is
10867                                                              not generic, omit
10868                                                              lgkmcnt(0).
10869                                                            - However, since LLVM
10870                                                              currently has no
10871                                                              address space on
10872                                                              the fence need to
10873                                                              conservatively
10874                                                              always generate
10875                                                              (see comment for
10876                                                              previous fence).
10877                                                            - Could be split into
10878                                                              separate s_waitcnt
10879                                                              vmcnt(0) and
10880                                                              s_waitcnt
10881                                                              lgkmcnt(0) to allow
10882                                                              them to be
10883                                                              independently moved
10884                                                              according to the
10885                                                              following rules.
10886                                                            - s_waitcnt vmcnt(0)
10887                                                              must happen after
10888                                                              any preceding
10889                                                              global/generic
10890                                                              load/store/load
10891                                                              atomic/store
10892                                                              atomic/atomicrmw.
10893                                                            - s_waitcnt lgkmcnt(0)
10894                                                              must happen after
10895                                                              any preceding
10896                                                              local/generic
10897                                                              load/store/load
10898                                                              atomic/store
10899                                                              atomic/atomicrmw.
10900                                                            - Must happen before
10901                                                              the following
10902                                                              buffer_inv.
10903                                                            - Ensures that the
10904                                                              preceding
10905                                                              global/local/generic
10906                                                              load
10907                                                              atomic/atomicrmw
10908                                                              with an equal or
10909                                                              wider sync scope
10910                                                              and memory ordering
10911                                                              stronger than
10912                                                              unordered (this is
10913                                                              termed the
10914                                                              acquire-fence-paired-atomic)
10915                                                              has completed
10916                                                              before invalidating
10917                                                              the cache. This
10918                                                              satisfies the
10919                                                              requirements of
10920                                                              acquire.
10921                                                            - Ensures that all
10922                                                              previous memory
10923                                                              operations have
10924                                                              completed before a
10925                                                              following
10926                                                              global/local/generic
10927                                                              store
10928                                                              atomic/atomicrmw
10929                                                              with an equal or
10930                                                              wider sync scope
10931                                                              and memory ordering
10932                                                              stronger than
10933                                                              unordered (this is
10934                                                              termed the
10935                                                              release-fence-paired-atomic).
10936                                                              This satisfies the
10937                                                              requirements of
10938                                                              release.
10939
10940                                                          3. buffer_inv sc1=1
10941
10942                                                            - Must happen before
10943                                                              any following
10944                                                              global/generic
10945                                                              load/load
10946                                                              atomic/store/store
10947                                                              atomic/atomicrmw.
10948                                                            - Ensures that
10949                                                              following loads
10950                                                              will not see stale
10951                                                              global data. This
10952                                                              satisfies the
10953                                                              requirements of
10954                                                              acquire.
10955
10956      fence        acq_rel      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
10957
10958                                                            - If OpenCL and
10959                                                              address space is
10960                                                              local, omit.
10961                                                            - Must happen before
10962                                                              following s_waitcnt.
10963                                                            - Performs L2 writeback to
10964                                                              ensure previous
10965                                                              global/generic
10966                                                              store/atomicrmw are
10967                                                              visible at system scope.
10968
10969                                                          1. s_waitcnt lgkmcnt(0) &
10970                                                             vmcnt(0)
10971
10972                                                            - If TgSplit execution mode,
10973                                                              omit lgkmcnt(0).
10974                                                            - If OpenCL and
10975                                                              address space is
10976                                                              not generic, omit
10977                                                              lgkmcnt(0).
10978                                                            - However, since LLVM
10979                                                              currently has no
10980                                                              address space on
10981                                                              the fence need to
10982                                                              conservatively
10983                                                              always generate
10984                                                              (see comment for
10985                                                              previous fence).
10986                                                            - Could be split into
10987                                                              separate s_waitcnt
10988                                                              vmcnt(0) and
10989                                                              s_waitcnt
10990                                                              lgkmcnt(0) to allow
10991                                                              them to be
10992                                                              independently moved
10993                                                              according to the
10994                                                              following rules.
10995                                                            - s_waitcnt vmcnt(0)
10996                                                              must happen after
10997                                                              any preceding
10998                                                              global/generic
10999                                                              load/store/load
11000                                                              atomic/store
11001                                                              atomic/atomicrmw.
11002                                                            - s_waitcnt lgkmcnt(0)
11003                                                              must happen after
11004                                                              any preceding
11005                                                              local/generic
11006                                                              load/store/load
11007                                                              atomic/store
11008                                                              atomic/atomicrmw.
11009                                                            - Must happen before
11010                                                              the following
11011                                                              buffer_inv.
11012                                                            - Ensures that the
11013                                                              preceding
11014                                                              global/local/generic
11015                                                              load
11016                                                              atomic/atomicrmw
11017                                                              with an equal or
11018                                                              wider sync scope
11019                                                              and memory ordering
11020                                                              stronger than
11021                                                              unordered (this is
11022                                                              termed the
11023                                                              acquire-fence-paired-atomic)
11024                                                              has completed
11025                                                              before invalidating
11026                                                              the cache. This
11027                                                              satisfies the
11028                                                              requirements of
11029                                                              acquire.
11030                                                            - Ensures that all
11031                                                              previous memory
11032                                                              operations have
11033                                                              completed before a
11034                                                              following
11035                                                              global/local/generic
11036                                                              store
11037                                                              atomic/atomicrmw
11038                                                              with an equal or
11039                                                              wider sync scope
11040                                                              and memory ordering
11041                                                              stronger than
11042                                                              unordered (this is
11043                                                              termed the
11044                                                              release-fence-paired-atomic).
11045                                                              This satisfies the
11046                                                              requirements of
11047                                                              release.
11048
11049                                                          2. buffer_inv sc0=1 sc1=1
11050
11051                                                            - Must happen before
11052                                                              any following
11053                                                              global/generic
11054                                                              load/load
11055                                                              atomic/store/store
11056                                                              atomic/atomicrmw.
11057                                                            - Ensures that
11058                                                              following loads
11059                                                              will not see stale
11060                                                              MTYPE NC global data.
11061                                                              MTYPE RW and CC memory will
11062                                                              never be stale due to the
11063                                                              memory probes.
11064
11065      **Sequential Consistent Atomic**
11066      ------------------------------------------------------------------------------------
11067      load atomic  seq_cst      - singlethread - global   *Same as corresponding
11068                                - wavefront    - local    load atomic acquire,
11069                                               - generic  except must generate
11070                                                          all instructions even
11071                                                          for OpenCL.*
11072      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
11073                                               - generic
11074                                                            - Use lgkmcnt(0) if not
11075                                                              TgSplit execution mode
11076                                                              and vmcnt(0) if TgSplit
11077                                                              execution mode.
11078                                                            - s_waitcnt lgkmcnt(0) must
11079                                                              happen after
11080                                                              preceding
11081                                                              local/generic load
11082                                                              atomic/store
11083                                                              atomic/atomicrmw
11084                                                              with memory
11085                                                              ordering of seq_cst
11086                                                              and with equal or
11087                                                              wider sync scope.
11088                                                              (Note that seq_cst
11089                                                              fences have their
11090                                                              own s_waitcnt
11091                                                              lgkmcnt(0) and so do
11092                                                              not need to be
11093                                                              considered.)
11094                                                            - s_waitcnt vmcnt(0)
11095                                                              must happen after
11096                                                              preceding
11097                                                              global/generic load
11098                                                              atomic/store
11099                                                              atomic/atomicrmw
11100                                                              with memory
11101                                                              ordering of seq_cst
11102                                                              and with equal or
11103                                                              wider sync scope.
11104                                                              (Note that seq_cst
11105                                                              fences have their
11106                                                              own s_waitcnt
11107                                                              vmcnt(0) and so do
11108                                                              not need to be
11109                                                              considered.)
11110                                                            - Ensures any
11111                                                              preceding
11112                                                              sequential
11113                                                              consistent global/local
11114                                                              memory instructions
11115                                                              have completed
11116                                                              before executing
11117                                                              this sequentially
11118                                                              consistent
11119                                                              instruction. This
11120                                                              prevents reordering
11121                                                              a seq_cst store
11122                                                              followed by a
11123                                                              seq_cst load. (Note
11124                                                              that seq_cst is
11125                                                              stronger than
11126                                                              acquire/release as
11127                                                              the reordering of
11128                                                              load acquire
11129                                                              followed by a store
11130                                                              release is
11131                                                              prevented by the
11132                                                              s_waitcnt of
11133                                                              the release, but
11134                                                              there is nothing
11135                                                              preventing a store
11136                                                              release followed by
11137                                                              load acquire from
11138                                                              completing out of
11139                                                              order. The s_waitcnt
11140                                                              could be placed after
11141                                                              seq_store or before
11142                                                              the seq_load. We
11143                                                              choose the load to
11144                                                              make the s_waitcnt be
11145                                                              as late as possible
11146                                                              so that the store
11147                                                              may have already
11148                                                              completed.)
11149
11150                                                          2. *Following
11151                                                             instructions same as
11152                                                             corresponding load
11153                                                             atomic acquire,
11154                                                             except must generate
11155                                                             all instructions even
11156                                                             for OpenCL.*
11157      load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
11158                                                          local address space cannot
11159                                                          be used.*
11160
11161                                                          *Same as corresponding
11162                                                          load atomic acquire,
11163                                                          except must generate
11164                                                          all instructions even
11165                                                          for OpenCL.*
11166
11167      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
11168                                - system       - generic     vmcnt(0)
11169
11170                                                            - If TgSplit execution mode,
11171                                                              omit lgkmcnt(0).
11172                                                            - Could be split into
11173                                                              separate s_waitcnt
11174                                                              vmcnt(0)
11175                                                              and s_waitcnt
11176                                                              lgkmcnt(0) to allow
11177                                                              them to be
11178                                                              independently moved
11179                                                              according to the
11180                                                              following rules.
11181                                                            - s_waitcnt lgkmcnt(0)
11182                                                              must happen after
11183                                                              preceding
11184                                                              global/generic load
11185                                                              atomic/store
11186                                                              atomic/atomicrmw
11187                                                              with memory
11188                                                              ordering of seq_cst
11189                                                              and with equal or
11190                                                              wider sync scope.
11191                                                              (Note that seq_cst
11192                                                              fences have their
11193                                                              own s_waitcnt
11194                                                              lgkmcnt(0) and so do
11195                                                              not need to be
11196                                                              considered.)
11197                                                            - s_waitcnt vmcnt(0)
11198                                                              must happen after
11199                                                              preceding
11200                                                              global/generic load
11201                                                              atomic/store
11202                                                              atomic/atomicrmw
11203                                                              with memory
11204                                                              ordering of seq_cst
11205                                                              and with equal or
11206                                                              wider sync scope.
11207                                                              (Note that seq_cst
11208                                                              fences have their
11209                                                              own s_waitcnt
11210                                                              vmcnt(0) and so do
11211                                                              not need to be
11212                                                              considered.)
11213                                                            - Ensures any
11214                                                              preceding
11215                                                              sequential
11216                                                              consistent global
11217                                                              memory instructions
11218                                                              have completed
11219                                                              before executing
11220                                                              this sequentially
11221                                                              consistent
11222                                                              instruction. This
11223                                                              prevents reordering
11224                                                              a seq_cst store
11225                                                              followed by a
11226                                                              seq_cst load. (Note
11227                                                              that seq_cst is
11228                                                              stronger than
11229                                                              acquire/release as
11230                                                              the reordering of
11231                                                              load acquire
11232                                                              followed by a store
11233                                                              release is
11234                                                              prevented by the
11235                                                              s_waitcnt of
11236                                                              the release, but
11237                                                              there is nothing
11238                                                              preventing a store
11239                                                              release followed by
11240                                                              load acquire from
11241                                                              completing out of
11242                                                              order. The s_waitcnt
11243                                                              could be placed after
11244                                                              seq_store or before
11245                                                              the seq_load. We
11246                                                              choose the load to
11247                                                              make the s_waitcnt be
11248                                                              as late as possible
11249                                                              so that the store
11250                                                              may have already
11251                                                              completed.)
11252
11253                                                          2. *Following
11254                                                             instructions same as
11255                                                             corresponding load
11256                                                             atomic acquire,
11257                                                             except must generate
11258                                                             all instructions even
11259                                                             for OpenCL.*
11260      store atomic seq_cst      - singlethread - global   *Same as corresponding
11261                                - wavefront    - local    store atomic release,
11262                                - workgroup    - generic  except must generate
11263                                - agent                   all instructions even
11264                                - system                  for OpenCL.*
11265      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
11266                                - wavefront    - local    atomicrmw acq_rel,
11267                                - workgroup    - generic  except must generate
11268                                - agent                   all instructions even
11269                                - system                  for OpenCL.*
11270      fence        seq_cst      - singlethread *none*     *Same as corresponding
11271                                - wavefront               fence acq_rel,
11272                                - workgroup               except must generate
11273                                - agent                   all instructions even
11274                                - system                  for OpenCL.*
11275      ============ ============ ============== ========== ================================
11276
11277 .. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
11278
11279 Memory Model GFX10-GFX11
11280 ++++++++++++++++++++++++
11281
11282 For GFX10-GFX11:
11283
11284 * Each agent has multiple shader arrays (SA).
11285 * Each SA has multiple work-group processors (WGP).
11286 * Each WGP has multiple compute units (CU).
11287 * Each CU has multiple SIMDs that execute wavefronts.
11288 * The wavefronts for a single work-group are executed in the same
11289   WGP. In CU wavefront execution mode the wavefronts may be executed by
11290   different SIMDs in the same CU. In WGP wavefront execution mode the
11291   wavefronts may be executed by different SIMDs in different CUs in the same
11292   WGP.
11293 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
11294   executing on it.
11295 * All LDS operations of a WGP are performed as wavefront wide operations in a
11296   global order and involve no caching. Completion is reported to a wavefront in
11297   execution order.
11298 * The LDS memory has multiple request queues shared by the SIMDs of a
11299   WGP. Therefore, the LDS operations performed by different wavefronts of a
11300   work-group can be reordered relative to each other, which can result in
11301   reordering the visibility of vector memory operations with respect to LDS
11302   operations of other wavefronts in the same work-group. A ``s_waitcnt
11303   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
11304   vector memory operations between wavefronts of a work-group, but not between
11305   operations performed by the same wavefront.
11306 * The vector memory operations are performed as wavefront wide operations.
11307   Completion of load/store/sample operations are reported to a wavefront in
11308   execution order of other load/store/sample operations performed by that
11309   wavefront.
11310 * The vector memory operations access a vector L0 cache. There is a single L0
11311   cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
11312   special action is required for coherence between the lanes of a single
11313   wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
11314   wavefronts executing in the same work-group as they may be executing on SIMDs
11315   of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
11316   required for coherence between wavefronts executing in different work-groups
11317   as they may be executing on different WGPs.
11318 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
11319   on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
11320   operations are used in a restricted way so do not impact the memory model. See
11321   :ref:`amdgpu-amdhsa-memory-spaces`.
11322 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
11323   the same SA. Therefore, no special action is required for coherence between
11324   the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
11325   required for coherence between wavefronts executing in different work-groups
11326   as they may be executing on different SAs that access different L1s.
11327 * The L1 caches have independent quadrants to service disjoint ranges of virtual
11328   addresses.
11329 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
11330   vector and scalar memory operations performed by different wavefronts, whether
11331   executing in the same or different work-groups (which may be executing on
11332   different CUs accessing different L0s), can be reordered relative to each
11333   other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
11334   synchronization between vector memory operations of different wavefronts. It
11335   ensures a previous vector memory operation has completed before executing a
11336   subsequent vector memory or LDS operation and so can be used to meet the
11337   requirements of acquire, release and sequential consistency.
11338 * The L1 caches use an L2 cache shared by all SAs on the same agent.
11339 * The L2 cache has independent channels to service disjoint ranges of virtual
11340   addresses.
11341 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
11342   quadrant has a separate request queue per L2 channel. Therefore, the vector
11343   and scalar memory operations performed by wavefronts executing in different
11344   work-groups (which may be executing on different SAs) of an agent can be
11345   reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
11346   required to ensure synchronization between vector memory operations of
11347   different SAs. It ensures a previous vector memory operation has completed
11348   before executing a subsequent vector memory and so can be used to meet the
11349   requirements of acquire, release and sequential consistency.
11350 * The L2 cache can be kept coherent with other agents on some targets, or ranges
11351   of virtual addresses can be set up to bypass it to ensure system coherence.
11352 * On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
11353   The MALL cache is fully coherent with GPU memory and has no impact on system
11354   coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
11355
11356 Scalar memory operations are only used to access memory that is proven to not
11357 change during the execution of the kernel dispatch. This includes constant
11358 address space and global address space for program scope ``const`` variables.
11359 Therefore, the kernel machine code does not have to maintain the scalar cache to
11360 ensure it is coherent with the vector caches. The scalar and vector caches are
11361 invalidated between kernel dispatches by CP since constant address space data
11362 may change between kernel dispatch executions. See
11363 :ref:`amdgpu-amdhsa-memory-spaces`.
11364
11365 The one exception is if scalar writes are used to spill SGPR registers. In this
11366 case the AMDGPU backend ensures the memory location used to spill is never
11367 accessed by vector memory operations at the same time. If scalar writes are used
11368 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
11369 return since the locations may be used for vector memory instructions by a
11370 future wavefront that uses the same scratch area, or a function call that
11371 creates a frame at the same address, respectively. There is no need for a
11372 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
11373
11374 For kernarg backing memory:
11375
11376 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
11377 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
11378   needing to invalidate the L2 cache.
11379 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
11380   so the L2 cache will be coherent with the CPU and other agents.
11381
11382 Scratch backing memory (which is used for the private address space) is accessed
11383 with MTYPE NC (non-coherent). Since the private address space is only accessed
11384 by a single thread, and is always write-before-read, there is never a need to
11385 invalidate these entries from the L0 or L1 caches.
11386
11387 Wavefronts are executed in native mode with in-order reporting of loads and
11388 sample instructions. In this mode vmcnt reports completion of load, atomic with
11389 return and sample instructions in order, and the vscnt reports the completion of
11390 store and atomic without return in order. See ``MEM_ORDERED`` field in
11391 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
11392
11393 Wavefronts can be executed in WGP or CU wavefront execution mode:
11394
11395 * In WGP wavefront execution mode the wavefronts of a work-group are executed
11396   on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
11397   CU L0 caches is required for work-group synchronization. Also accesses to L1
11398   at work-group scope need to be explicitly ordered as the accesses from
11399   different CUs are not ordered.
11400 * In CU wavefront execution mode the wavefronts of a work-group are executed on
11401   the SIMDs of a single CU of the WGP. Therefore, all global memory access by
11402   the work-group access the same L0 which in turn ensures L1 accesses are
11403   ordered and so do not require explicit management of the caches for
11404   work-group synchronization.
11405
11406 See ``WGP_MODE`` field in
11407 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table` and
11408 :ref:`amdgpu-target-features`.
11409
11410 The code sequences used to implement the memory model for GFX10-GFX11 are defined in
11411 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
11412
11413   .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
11414      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
11415
11416      ============ ============ ============== ========== ================================
11417      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
11418                   Ordering     Sync Scope     Address    GFX10-GFX11
11419                                               Space
11420      ============ ============ ============== ========== ================================
11421      **Non-Atomic**
11422      ------------------------------------------------------------------------------------
11423      load         *none*       *none*         - global   - !volatile & !nontemporal
11424                                               - generic
11425                                               - private    1. buffer/global/flat_load
11426                                               - constant
11427                                                          - !volatile & nontemporal
11428
11429                                                            1. buffer/global/flat_load
11430                                                               slc=1 dlc=1
11431
11432                                                             - If GFX10, omit dlc=1.
11433
11434                                                          - volatile
11435
11436                                                            1. buffer/global/flat_load
11437                                                               glc=1 dlc=1
11438
11439                                                            2. s_waitcnt vmcnt(0)
11440
11441                                                             - Must happen before
11442                                                               any following volatile
11443                                                               global/generic
11444                                                               load/store.
11445                                                             - Ensures that
11446                                                               volatile
11447                                                               operations to
11448                                                               different
11449                                                               addresses will not
11450                                                               be reordered by
11451                                                               hardware.
11452
11453      load         *none*       *none*         - local    1. ds_load
11454      store        *none*       *none*         - global   - !volatile & !nontemporal
11455                                               - generic
11456                                               - private    1. buffer/global/flat_store
11457                                               - constant
11458                                                          - !volatile & nontemporal
11459
11460                                                            1. buffer/global/flat_store
11461                                                               glc=1 slc=1 dlc=1
11462
11463                                                             - If GFX10, omit dlc=1.
11464
11465                                                          - volatile
11466
11467                                                            1. buffer/global/flat_store
11468                                                               dlc=1
11469
11470                                                             - If GFX10, omit dlc=1.
11471
11472                                                            2. s_waitcnt vscnt(0)
11473
11474                                                             - Must happen before
11475                                                               any following volatile
11476                                                               global/generic
11477                                                               load/store.
11478                                                             - Ensures that
11479                                                               volatile
11480                                                               operations to
11481                                                               different
11482                                                               addresses will not
11483                                                               be reordered by
11484                                                               hardware.
11485
11486      store        *none*       *none*         - local    1. ds_store
11487      **Unordered Atomic**
11488      ------------------------------------------------------------------------------------
11489      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
11490      store atomic unordered    *any*          *any*      *Same as non-atomic*.
11491      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
11492      **Monotonic Atomic**
11493      ------------------------------------------------------------------------------------
11494      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
11495                                - wavefront    - generic
11496      load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
11497                                               - generic     glc=1
11498
11499                                                            - If CU wavefront execution
11500                                                              mode, omit glc=1.
11501
11502      load atomic  monotonic    - singlethread - local    1. ds_load
11503                                - wavefront
11504                                - workgroup
11505      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
11506                                - system       - generic     glc=1 dlc=1
11507
11508                                                            - If GFX11, omit dlc=1.
11509
11510      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
11511                                - wavefront    - generic
11512                                - workgroup
11513                                - agent
11514                                - system
11515      store atomic monotonic    - singlethread - local    1. ds_store
11516                                - wavefront
11517                                - workgroup
11518      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
11519                                - wavefront    - generic
11520                                - workgroup
11521                                - agent
11522                                - system
11523      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
11524                                - wavefront
11525                                - workgroup
11526      **Acquire Atomic**
11527      ------------------------------------------------------------------------------------
11528      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
11529                                - wavefront    - local
11530                                               - generic
11531      load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
11532
11533                                                            - If CU wavefront execution
11534                                                              mode, omit glc=1.
11535
11536                                                          2. s_waitcnt vmcnt(0)
11537
11538                                                            - If CU wavefront execution
11539                                                              mode, omit.
11540                                                            - Must happen before
11541                                                              the following buffer_gl0_inv
11542                                                              and before any following
11543                                                              global/generic
11544                                                              load/load
11545                                                              atomic/store/store
11546                                                              atomic/atomicrmw.
11547
11548                                                          3. buffer_gl0_inv
11549
11550                                                            - If CU wavefront execution
11551                                                              mode, omit.
11552                                                            - Ensures that
11553                                                              following
11554                                                              loads will not see
11555                                                              stale data.
11556
11557      load atomic  acquire      - workgroup    - local    1. ds_load
11558                                                          2. s_waitcnt lgkmcnt(0)
11559
11560                                                            - If OpenCL, omit.
11561                                                            - Must happen before
11562                                                              the following buffer_gl0_inv
11563                                                              and before any following
11564                                                              global/generic load/load
11565                                                              atomic/store/store
11566                                                              atomic/atomicrmw.
11567                                                            - Ensures any
11568                                                              following global
11569                                                              data read is no
11570                                                              older than the local load
11571                                                              atomic value being
11572                                                              acquired.
11573
11574                                                          3. buffer_gl0_inv
11575
11576                                                            - If CU wavefront execution
11577                                                              mode, omit.
11578                                                            - If OpenCL, omit.
11579                                                            - Ensures that
11580                                                              following
11581                                                              loads will not see
11582                                                              stale data.
11583
11584      load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
11585
11586                                                            - If CU wavefront execution
11587                                                              mode, omit glc=1.
11588
11589                                                          2. s_waitcnt lgkmcnt(0) &
11590                                                             vmcnt(0)
11591
11592                                                            - If CU wavefront execution
11593                                                              mode, omit vmcnt(0).
11594                                                            - If OpenCL, omit
11595                                                              lgkmcnt(0).
11596                                                            - Must happen before
11597                                                              the following
11598                                                              buffer_gl0_inv and any
11599                                                              following global/generic
11600                                                              load/load
11601                                                              atomic/store/store
11602                                                              atomic/atomicrmw.
11603                                                            - Ensures any
11604                                                              following global
11605                                                              data read is no
11606                                                              older than a local load
11607                                                              atomic value being
11608                                                              acquired.
11609
11610                                                          3. buffer_gl0_inv
11611
11612                                                            - If CU wavefront execution
11613                                                              mode, omit.
11614                                                            - Ensures that
11615                                                              following
11616                                                              loads will not see
11617                                                              stale data.
11618
11619      load atomic  acquire      - agent        - global   1. buffer/global_load
11620                                - system                     glc=1 dlc=1
11621
11622                                                            - If GFX11, omit dlc=1.
11623
11624                                                          2. s_waitcnt vmcnt(0)
11625
11626                                                            - Must happen before
11627                                                              following
11628                                                              buffer_gl*_inv.
11629                                                            - Ensures the load
11630                                                              has completed
11631                                                              before invalidating
11632                                                              the caches.
11633
11634                                                          3. buffer_gl0_inv;
11635                                                             buffer_gl1_inv
11636
11637                                                            - Must happen before
11638                                                              any following
11639                                                              global/generic
11640                                                              load/load
11641                                                              atomic/atomicrmw.
11642                                                            - Ensures that
11643                                                              following
11644                                                              loads will not see
11645                                                              stale global data.
11646
11647      load atomic  acquire      - agent        - generic  1. flat_load glc=1 dlc=1
11648                                - system
11649                                                            - If GFX11, omit dlc=1.
11650
11651                                                          2. s_waitcnt vmcnt(0) &
11652                                                             lgkmcnt(0)
11653
11654                                                            - If OpenCL omit
11655                                                              lgkmcnt(0).
11656                                                            - Must happen before
11657                                                              following
11658                                                              buffer_gl*_invl.
11659                                                            - Ensures the flat_load
11660                                                              has completed
11661                                                              before invalidating
11662                                                              the caches.
11663
11664                                                          3. buffer_gl0_inv;
11665                                                             buffer_gl1_inv
11666
11667                                                            - Must happen before
11668                                                              any following
11669                                                              global/generic
11670                                                              load/load
11671                                                              atomic/atomicrmw.
11672                                                            - Ensures that
11673                                                              following loads
11674                                                              will not see stale
11675                                                              global data.
11676
11677      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
11678                                - wavefront    - local
11679                                               - generic
11680      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
11681                                                          2. s_waitcnt vm/vscnt(0)
11682
11683                                                            - If CU wavefront execution
11684                                                              mode, omit.
11685                                                            - Use vmcnt(0) if atomic with
11686                                                              return and vscnt(0) if
11687                                                              atomic with no-return.
11688                                                            - Must happen before
11689                                                              the following buffer_gl0_inv
11690                                                              and before any following
11691                                                              global/generic
11692                                                              load/load
11693                                                              atomic/store/store
11694                                                              atomic/atomicrmw.
11695
11696                                                          3. buffer_gl0_inv
11697
11698                                                            - If CU wavefront execution
11699                                                              mode, omit.
11700                                                            - Ensures that
11701                                                              following
11702                                                              loads will not see
11703                                                              stale data.
11704
11705      atomicrmw    acquire      - workgroup    - local    1. ds_atomic
11706                                                          2. s_waitcnt lgkmcnt(0)
11707
11708                                                            - If OpenCL, omit.
11709                                                            - Must happen before
11710                                                              the following
11711                                                              buffer_gl0_inv.
11712                                                            - Ensures any
11713                                                              following global
11714                                                              data read is no
11715                                                              older than the local
11716                                                              atomicrmw value
11717                                                              being acquired.
11718
11719                                                          3. buffer_gl0_inv
11720
11721                                                            - If OpenCL omit.
11722                                                            - Ensures that
11723                                                              following
11724                                                              loads will not see
11725                                                              stale data.
11726
11727      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
11728                                                          2. s_waitcnt lgkmcnt(0) &
11729                                                             vm/vscnt(0)
11730
11731                                                            - If CU wavefront execution
11732                                                              mode, omit vm/vscnt(0).
11733                                                            - If OpenCL, omit lgkmcnt(0).
11734                                                            - Use vmcnt(0) if atomic with
11735                                                              return and vscnt(0) if
11736                                                              atomic with no-return.
11737                                                            - Must happen before
11738                                                              the following
11739                                                              buffer_gl0_inv.
11740                                                            - Ensures any
11741                                                              following global
11742                                                              data read is no
11743                                                              older than a local
11744                                                              atomicrmw value
11745                                                              being acquired.
11746
11747                                                          3. buffer_gl0_inv
11748
11749                                                            - If CU wavefront execution
11750                                                              mode, omit.
11751                                                            - Ensures that
11752                                                              following
11753                                                              loads will not see
11754                                                              stale data.
11755
11756      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
11757                                - system                  2. s_waitcnt vm/vscnt(0)
11758
11759                                                            - Use vmcnt(0) if atomic with
11760                                                              return and vscnt(0) if
11761                                                              atomic with no-return.
11762                                                            - Must happen before
11763                                                              following
11764                                                              buffer_gl*_inv.
11765                                                            - Ensures the
11766                                                              atomicrmw has
11767                                                              completed before
11768                                                              invalidating the
11769                                                              caches.
11770
11771                                                          3. buffer_gl0_inv;
11772                                                             buffer_gl1_inv
11773
11774                                                            - Must happen before
11775                                                              any following
11776                                                              global/generic
11777                                                              load/load
11778                                                              atomic/atomicrmw.
11779                                                            - Ensures that
11780                                                              following loads
11781                                                              will not see stale
11782                                                              global data.
11783
11784      atomicrmw    acquire      - agent        - generic  1. flat_atomic
11785                                - system                  2. s_waitcnt vm/vscnt(0) &
11786                                                             lgkmcnt(0)
11787
11788                                                            - If OpenCL, omit
11789                                                              lgkmcnt(0).
11790                                                            - Use vmcnt(0) if atomic with
11791                                                              return and vscnt(0) if
11792                                                              atomic with no-return.
11793                                                            - Must happen before
11794                                                              following
11795                                                              buffer_gl*_inv.
11796                                                            - Ensures the
11797                                                              atomicrmw has
11798                                                              completed before
11799                                                              invalidating the
11800                                                              caches.
11801
11802                                                          3. buffer_gl0_inv;
11803                                                             buffer_gl1_inv
11804
11805                                                            - Must happen before
11806                                                              any following
11807                                                              global/generic
11808                                                              load/load
11809                                                              atomic/atomicrmw.
11810                                                            - Ensures that
11811                                                              following loads
11812                                                              will not see stale
11813                                                              global data.
11814
11815      fence        acquire      - singlethread *none*     *none*
11816                                - wavefront
11817      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
11818                                                             vmcnt(0) & vscnt(0)
11819
11820                                                            - If CU wavefront execution
11821                                                              mode, omit vmcnt(0) and
11822                                                              vscnt(0).
11823                                                            - If OpenCL and
11824                                                              address space is
11825                                                              not generic, omit
11826                                                              lgkmcnt(0).
11827                                                            - If OpenCL and
11828                                                              address space is
11829                                                              local, omit
11830                                                              vmcnt(0) and vscnt(0).
11831                                                            - However, since LLVM
11832                                                              currently has no
11833                                                              address space on
11834                                                              the fence need to
11835                                                              conservatively
11836                                                              always generate. If
11837                                                              fence had an
11838                                                              address space then
11839                                                              set to address
11840                                                              space of OpenCL
11841                                                              fence flag, or to
11842                                                              generic if both
11843                                                              local and global
11844                                                              flags are
11845                                                              specified.
11846                                                            - Could be split into
11847                                                              separate s_waitcnt
11848                                                              vmcnt(0), s_waitcnt
11849                                                              vscnt(0) and s_waitcnt
11850                                                              lgkmcnt(0) to allow
11851                                                              them to be
11852                                                              independently moved
11853                                                              according to the
11854                                                              following rules.
11855                                                            - s_waitcnt vmcnt(0)
11856                                                              must happen after
11857                                                              any preceding
11858                                                              global/generic load
11859                                                              atomic/
11860                                                              atomicrmw-with-return-value
11861                                                              with an equal or
11862                                                              wider sync scope
11863                                                              and memory ordering
11864                                                              stronger than
11865                                                              unordered (this is
11866                                                              termed the
11867                                                              fence-paired-atomic).
11868                                                            - s_waitcnt vscnt(0)
11869                                                              must happen after
11870                                                              any preceding
11871                                                              global/generic
11872                                                              atomicrmw-no-return-value
11873                                                              with an equal or
11874                                                              wider sync scope
11875                                                              and memory ordering
11876                                                              stronger than
11877                                                              unordered (this is
11878                                                              termed the
11879                                                              fence-paired-atomic).
11880                                                            - s_waitcnt lgkmcnt(0)
11881                                                              must happen after
11882                                                              any preceding
11883                                                              local/generic load
11884                                                              atomic/atomicrmw
11885                                                              with an equal or
11886                                                              wider sync scope
11887                                                              and memory ordering
11888                                                              stronger than
11889                                                              unordered (this is
11890                                                              termed the
11891                                                              fence-paired-atomic).
11892                                                            - Must happen before
11893                                                              the following
11894                                                              buffer_gl0_inv.
11895                                                            - Ensures that the
11896                                                              fence-paired atomic
11897                                                              has completed
11898                                                              before invalidating
11899                                                              the
11900                                                              cache. Therefore
11901                                                              any following
11902                                                              locations read must
11903                                                              be no older than
11904                                                              the value read by
11905                                                              the
11906                                                              fence-paired-atomic.
11907
11908                                                          3. buffer_gl0_inv
11909
11910                                                            - If CU wavefront execution
11911                                                              mode, omit.
11912                                                            - Ensures that
11913                                                              following
11914                                                              loads will not see
11915                                                              stale data.
11916
11917      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
11918                                - system                     vmcnt(0) & vscnt(0)
11919
11920                                                            - If OpenCL and
11921                                                              address space is
11922                                                              not generic, omit
11923                                                              lgkmcnt(0).
11924                                                            - If OpenCL and
11925                                                              address space is
11926                                                              local, omit
11927                                                              vmcnt(0) and vscnt(0).
11928                                                            - However, since LLVM
11929                                                              currently has no
11930                                                              address space on
11931                                                              the fence need to
11932                                                              conservatively
11933                                                              always generate
11934                                                              (see comment for
11935                                                              previous fence).
11936                                                            - Could be split into
11937                                                              separate s_waitcnt
11938                                                              vmcnt(0), s_waitcnt
11939                                                              vscnt(0) and s_waitcnt
11940                                                              lgkmcnt(0) to allow
11941                                                              them to be
11942                                                              independently moved
11943                                                              according to the
11944                                                              following rules.
11945                                                            - s_waitcnt vmcnt(0)
11946                                                              must happen after
11947                                                              any preceding
11948                                                              global/generic load
11949                                                              atomic/
11950                                                              atomicrmw-with-return-value
11951                                                              with an equal or
11952                                                              wider sync scope
11953                                                              and memory ordering
11954                                                              stronger than
11955                                                              unordered (this is
11956                                                              termed the
11957                                                              fence-paired-atomic).
11958                                                            - s_waitcnt vscnt(0)
11959                                                              must happen after
11960                                                              any preceding
11961                                                              global/generic
11962                                                              atomicrmw-no-return-value
11963                                                              with an equal or
11964                                                              wider sync scope
11965                                                              and memory ordering
11966                                                              stronger than
11967                                                              unordered (this is
11968                                                              termed the
11969                                                              fence-paired-atomic).
11970                                                            - s_waitcnt lgkmcnt(0)
11971                                                              must happen after
11972                                                              any preceding
11973                                                              local/generic load
11974                                                              atomic/atomicrmw
11975                                                              with an equal or
11976                                                              wider sync scope
11977                                                              and memory ordering
11978                                                              stronger than
11979                                                              unordered (this is
11980                                                              termed the
11981                                                              fence-paired-atomic).
11982                                                            - Must happen before
11983                                                              the following
11984                                                              buffer_gl*_inv.
11985                                                            - Ensures that the
11986                                                              fence-paired atomic
11987                                                              has completed
11988                                                              before invalidating
11989                                                              the
11990                                                              caches. Therefore
11991                                                              any following
11992                                                              locations read must
11993                                                              be no older than
11994                                                              the value read by
11995                                                              the
11996                                                              fence-paired-atomic.
11997
11998                                                          2. buffer_gl0_inv;
11999                                                             buffer_gl1_inv
12000
12001                                                            - Must happen before any
12002                                                              following global/generic
12003                                                              load/load
12004                                                              atomic/store/store
12005                                                              atomic/atomicrmw.
12006                                                            - Ensures that
12007                                                              following loads
12008                                                              will not see stale
12009                                                              global data.
12010
12011      **Release Atomic**
12012      ------------------------------------------------------------------------------------
12013      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
12014                                - wavefront    - local
12015                                               - generic
12016      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12017                                               - generic     vmcnt(0) & vscnt(0)
12018
12019                                                            - If CU wavefront execution
12020                                                              mode, omit vmcnt(0) and
12021                                                              vscnt(0).
12022                                                            - If OpenCL, omit
12023                                                              lgkmcnt(0).
12024                                                            - Could be split into
12025                                                              separate s_waitcnt
12026                                                              vmcnt(0), s_waitcnt
12027                                                              vscnt(0) and s_waitcnt
12028                                                              lgkmcnt(0) to allow
12029                                                              them to be
12030                                                              independently moved
12031                                                              according to the
12032                                                              following rules.
12033                                                            - s_waitcnt vmcnt(0)
12034                                                              must happen after
12035                                                              any preceding
12036                                                              global/generic load/load
12037                                                              atomic/
12038                                                              atomicrmw-with-return-value.
12039                                                            - s_waitcnt vscnt(0)
12040                                                              must happen after
12041                                                              any preceding
12042                                                              global/generic
12043                                                              store/store
12044                                                              atomic/
12045                                                              atomicrmw-no-return-value.
12046                                                            - s_waitcnt lgkmcnt(0)
12047                                                              must happen after
12048                                                              any preceding
12049                                                              local/generic
12050                                                              load/store/load
12051                                                              atomic/store
12052                                                              atomic/atomicrmw.
12053                                                            - Must happen before
12054                                                              the following
12055                                                              store.
12056                                                            - Ensures that all
12057                                                              memory operations
12058                                                              have
12059                                                              completed before
12060                                                              performing the
12061                                                              store that is being
12062                                                              released.
12063
12064                                                          2. buffer/global/flat_store
12065      store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12066
12067                                                            - If CU wavefront execution
12068                                                              mode, omit.
12069                                                            - If OpenCL, omit.
12070                                                            - Could be split into
12071                                                              separate s_waitcnt
12072                                                              vmcnt(0) and s_waitcnt
12073                                                              vscnt(0) to allow
12074                                                              them to be
12075                                                              independently moved
12076                                                              according to the
12077                                                              following rules.
12078                                                            - s_waitcnt vmcnt(0)
12079                                                              must happen after
12080                                                              any preceding
12081                                                              global/generic load/load
12082                                                              atomic/
12083                                                              atomicrmw-with-return-value.
12084                                                            - s_waitcnt vscnt(0)
12085                                                              must happen after
12086                                                              any preceding
12087                                                              global/generic
12088                                                              store/store atomic/
12089                                                              atomicrmw-no-return-value.
12090                                                            - Must happen before
12091                                                              the following
12092                                                              store.
12093                                                            - Ensures that all
12094                                                              global memory
12095                                                              operations have
12096                                                              completed before
12097                                                              performing the
12098                                                              store that is being
12099                                                              released.
12100
12101                                                          2. ds_store
12102      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12103                                - system       - generic     vmcnt(0) & vscnt(0)
12104
12105                                                            - If OpenCL and
12106                                                              address space is
12107                                                              not generic, omit
12108                                                              lgkmcnt(0).
12109                                                            - Could be split into
12110                                                              separate s_waitcnt
12111                                                              vmcnt(0), s_waitcnt vscnt(0)
12112                                                              and s_waitcnt
12113                                                              lgkmcnt(0) to allow
12114                                                              them to be
12115                                                              independently moved
12116                                                              according to the
12117                                                              following rules.
12118                                                            - s_waitcnt vmcnt(0)
12119                                                              must happen after
12120                                                              any preceding
12121                                                              global/generic
12122                                                              load/load
12123                                                              atomic/
12124                                                              atomicrmw-with-return-value.
12125                                                            - s_waitcnt vscnt(0)
12126                                                              must happen after
12127                                                              any preceding
12128                                                              global/generic
12129                                                              store/store atomic/
12130                                                              atomicrmw-no-return-value.
12131                                                            - s_waitcnt lgkmcnt(0)
12132                                                              must happen after
12133                                                              any preceding
12134                                                              local/generic
12135                                                              load/store/load
12136                                                              atomic/store
12137                                                              atomic/atomicrmw.
12138                                                            - Must happen before
12139                                                              the following
12140                                                              store.
12141                                                            - Ensures that all
12142                                                              memory operations
12143                                                              have
12144                                                              completed before
12145                                                              performing the
12146                                                              store that is being
12147                                                              released.
12148
12149                                                          2. buffer/global/flat_store
12150      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
12151                                - wavefront    - local
12152                                               - generic
12153      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12154                                               - generic     vmcnt(0) & vscnt(0)
12155
12156                                                            - If CU wavefront execution
12157                                                              mode, omit vmcnt(0) and
12158                                                              vscnt(0).
12159                                                            - If OpenCL, omit lgkmcnt(0).
12160                                                            - Could be split into
12161                                                              separate s_waitcnt
12162                                                              vmcnt(0), s_waitcnt
12163                                                              vscnt(0) and s_waitcnt
12164                                                              lgkmcnt(0) to allow
12165                                                              them to be
12166                                                              independently moved
12167                                                              according to the
12168                                                              following rules.
12169                                                            - s_waitcnt vmcnt(0)
12170                                                              must happen after
12171                                                              any preceding
12172                                                              global/generic load/load
12173                                                              atomic/
12174                                                              atomicrmw-with-return-value.
12175                                                            - s_waitcnt vscnt(0)
12176                                                              must happen after
12177                                                              any preceding
12178                                                              global/generic
12179                                                              store/store
12180                                                              atomic/
12181                                                              atomicrmw-no-return-value.
12182                                                            - s_waitcnt lgkmcnt(0)
12183                                                              must happen after
12184                                                              any preceding
12185                                                              local/generic
12186                                                              load/store/load
12187                                                              atomic/store
12188                                                              atomic/atomicrmw.
12189                                                            - Must happen before
12190                                                              the following
12191                                                              atomicrmw.
12192                                                            - Ensures that all
12193                                                              memory operations
12194                                                              have
12195                                                              completed before
12196                                                              performing the
12197                                                              atomicrmw that is
12198                                                              being released.
12199
12200                                                          2. buffer/global/flat_atomic
12201      atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12202
12203                                                            - If CU wavefront execution
12204                                                              mode, omit.
12205                                                            - If OpenCL, omit.
12206                                                            - Could be split into
12207                                                              separate s_waitcnt
12208                                                              vmcnt(0) and s_waitcnt
12209                                                              vscnt(0) to allow
12210                                                              them to be
12211                                                              independently moved
12212                                                              according to the
12213                                                              following rules.
12214                                                            - s_waitcnt vmcnt(0)
12215                                                              must happen after
12216                                                              any preceding
12217                                                              global/generic load/load
12218                                                              atomic/
12219                                                              atomicrmw-with-return-value.
12220                                                            - s_waitcnt vscnt(0)
12221                                                              must happen after
12222                                                              any preceding
12223                                                              global/generic
12224                                                              store/store atomic/
12225                                                              atomicrmw-no-return-value.
12226                                                            - Must happen before
12227                                                              the following
12228                                                              store.
12229                                                            - Ensures that all
12230                                                              global memory
12231                                                              operations have
12232                                                              completed before
12233                                                              performing the
12234                                                              store that is being
12235                                                              released.
12236
12237                                                          2. ds_atomic
12238      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12239                                - system       - generic      vmcnt(0) & vscnt(0)
12240
12241                                                            - If OpenCL, omit
12242                                                              lgkmcnt(0).
12243                                                            - Could be split into
12244                                                              separate s_waitcnt
12245                                                              vmcnt(0), s_waitcnt
12246                                                              vscnt(0) and s_waitcnt
12247                                                              lgkmcnt(0) to allow
12248                                                              them to be
12249                                                              independently moved
12250                                                              according to the
12251                                                              following rules.
12252                                                            - s_waitcnt vmcnt(0)
12253                                                              must happen after
12254                                                              any preceding
12255                                                              global/generic
12256                                                              load/load atomic/
12257                                                              atomicrmw-with-return-value.
12258                                                            - s_waitcnt vscnt(0)
12259                                                              must happen after
12260                                                              any preceding
12261                                                              global/generic
12262                                                              store/store atomic/
12263                                                              atomicrmw-no-return-value.
12264                                                            - s_waitcnt lgkmcnt(0)
12265                                                              must happen after
12266                                                              any preceding
12267                                                              local/generic
12268                                                              load/store/load
12269                                                              atomic/store
12270                                                              atomic/atomicrmw.
12271                                                            - Must happen before
12272                                                              the following
12273                                                              atomicrmw.
12274                                                            - Ensures that all
12275                                                              memory operations
12276                                                              to global and local
12277                                                              have completed
12278                                                              before performing
12279                                                              the atomicrmw that
12280                                                              is being released.
12281
12282                                                          2. buffer/global/flat_atomic
12283      fence        release      - singlethread *none*     *none*
12284                                - wavefront
12285      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12286                                                             vmcnt(0) & vscnt(0)
12287
12288                                                            - If CU wavefront execution
12289                                                              mode, omit vmcnt(0) and
12290                                                              vscnt(0).
12291                                                            - If OpenCL and
12292                                                              address space is
12293                                                              not generic, omit
12294                                                              lgkmcnt(0).
12295                                                            - If OpenCL and
12296                                                              address space is
12297                                                              local, omit
12298                                                              vmcnt(0) and vscnt(0).
12299                                                            - However, since LLVM
12300                                                              currently has no
12301                                                              address space on
12302                                                              the fence need to
12303                                                              conservatively
12304                                                              always generate. If
12305                                                              fence had an
12306                                                              address space then
12307                                                              set to address
12308                                                              space of OpenCL
12309                                                              fence flag, or to
12310                                                              generic if both
12311                                                              local and global
12312                                                              flags are
12313                                                              specified.
12314                                                            - Could be split into
12315                                                              separate s_waitcnt
12316                                                              vmcnt(0), s_waitcnt
12317                                                              vscnt(0) and s_waitcnt
12318                                                              lgkmcnt(0) to allow
12319                                                              them to be
12320                                                              independently moved
12321                                                              according to the
12322                                                              following rules.
12323                                                            - s_waitcnt vmcnt(0)
12324                                                              must happen after
12325                                                              any preceding
12326                                                              global/generic
12327                                                              load/load
12328                                                              atomic/
12329                                                              atomicrmw-with-return-value.
12330                                                            - s_waitcnt vscnt(0)
12331                                                              must happen after
12332                                                              any preceding
12333                                                              global/generic
12334                                                              store/store atomic/
12335                                                              atomicrmw-no-return-value.
12336                                                            - s_waitcnt lgkmcnt(0)
12337                                                              must happen after
12338                                                              any preceding
12339                                                              local/generic
12340                                                              load/store/load
12341                                                              atomic/store atomic/
12342                                                              atomicrmw.
12343                                                            - Must happen before
12344                                                              any following store
12345                                                              atomic/atomicrmw
12346                                                              with an equal or
12347                                                              wider sync scope
12348                                                              and memory ordering
12349                                                              stronger than
12350                                                              unordered (this is
12351                                                              termed the
12352                                                              fence-paired-atomic).
12353                                                            - Ensures that all
12354                                                              memory operations
12355                                                              have
12356                                                              completed before
12357                                                              performing the
12358                                                              following
12359                                                              fence-paired-atomic.
12360
12361      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12362                                - system                     vmcnt(0) & vscnt(0)
12363
12364                                                            - If OpenCL and
12365                                                              address space is
12366                                                              not generic, omit
12367                                                              lgkmcnt(0).
12368                                                            - If OpenCL and
12369                                                              address space is
12370                                                              local, omit
12371                                                              vmcnt(0) and vscnt(0).
12372                                                            - However, since LLVM
12373                                                              currently has no
12374                                                              address space on
12375                                                              the fence need to
12376                                                              conservatively
12377                                                              always generate. If
12378                                                              fence had an
12379                                                              address space then
12380                                                              set to address
12381                                                              space of OpenCL
12382                                                              fence flag, or to
12383                                                              generic if both
12384                                                              local and global
12385                                                              flags are
12386                                                              specified.
12387                                                            - Could be split into
12388                                                              separate s_waitcnt
12389                                                              vmcnt(0), s_waitcnt
12390                                                              vscnt(0) and s_waitcnt
12391                                                              lgkmcnt(0) to allow
12392                                                              them to be
12393                                                              independently moved
12394                                                              according to the
12395                                                              following rules.
12396                                                            - s_waitcnt vmcnt(0)
12397                                                              must happen after
12398                                                              any preceding
12399                                                              global/generic
12400                                                              load/load atomic/
12401                                                              atomicrmw-with-return-value.
12402                                                            - s_waitcnt vscnt(0)
12403                                                              must happen after
12404                                                              any preceding
12405                                                              global/generic
12406                                                              store/store atomic/
12407                                                              atomicrmw-no-return-value.
12408                                                            - s_waitcnt lgkmcnt(0)
12409                                                              must happen after
12410                                                              any preceding
12411                                                              local/generic
12412                                                              load/store/load
12413                                                              atomic/store
12414                                                              atomic/atomicrmw.
12415                                                            - Must happen before
12416                                                              any following store
12417                                                              atomic/atomicrmw
12418                                                              with an equal or
12419                                                              wider sync scope
12420                                                              and memory ordering
12421                                                              stronger than
12422                                                              unordered (this is
12423                                                              termed the
12424                                                              fence-paired-atomic).
12425                                                            - Ensures that all
12426                                                              memory operations
12427                                                              have
12428                                                              completed before
12429                                                              performing the
12430                                                              following
12431                                                              fence-paired-atomic.
12432
12433      **Acquire-Release Atomic**
12434      ------------------------------------------------------------------------------------
12435      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
12436                                - wavefront    - local
12437                                               - generic
12438      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12439                                                             vmcnt(0) & vscnt(0)
12440
12441                                                            - If CU wavefront execution
12442                                                              mode, omit vmcnt(0) and
12443                                                              vscnt(0).
12444                                                            - If OpenCL, omit
12445                                                              lgkmcnt(0).
12446                                                            - Must happen after
12447                                                              any preceding
12448                                                              local/generic
12449                                                              load/store/load
12450                                                              atomic/store
12451                                                              atomic/atomicrmw.
12452                                                            - Could be split into
12453                                                              separate s_waitcnt
12454                                                              vmcnt(0), s_waitcnt
12455                                                              vscnt(0), and s_waitcnt
12456                                                              lgkmcnt(0) to allow
12457                                                              them to be
12458                                                              independently moved
12459                                                              according to the
12460                                                              following rules.
12461                                                            - s_waitcnt vmcnt(0)
12462                                                              must happen after
12463                                                              any preceding
12464                                                              global/generic load/load
12465                                                              atomic/
12466                                                              atomicrmw-with-return-value.
12467                                                            - s_waitcnt vscnt(0)
12468                                                              must happen after
12469                                                              any preceding
12470                                                              global/generic
12471                                                              store/store
12472                                                              atomic/
12473                                                              atomicrmw-no-return-value.
12474                                                            - s_waitcnt lgkmcnt(0)
12475                                                              must happen after
12476                                                              any preceding
12477                                                              local/generic
12478                                                              load/store/load
12479                                                              atomic/store
12480                                                              atomic/atomicrmw.
12481                                                            - Must happen before
12482                                                              the following
12483                                                              atomicrmw.
12484                                                            - Ensures that all
12485                                                              memory operations
12486                                                              have
12487                                                              completed before
12488                                                              performing the
12489                                                              atomicrmw that is
12490                                                              being released.
12491
12492                                                          2. buffer/global_atomic
12493                                                          3. s_waitcnt vm/vscnt(0)
12494
12495                                                            - If CU wavefront execution
12496                                                              mode, omit.
12497                                                            - Use vmcnt(0) if atomic with
12498                                                              return and vscnt(0) if
12499                                                              atomic with no-return.
12500                                                            - Must happen before
12501                                                              the following
12502                                                              buffer_gl0_inv.
12503                                                            - Ensures any
12504                                                              following global
12505                                                              data read is no
12506                                                              older than the
12507                                                              atomicrmw value
12508                                                              being acquired.
12509
12510                                                          4. buffer_gl0_inv
12511
12512                                                            - If CU wavefront execution
12513                                                              mode, omit.
12514                                                            - Ensures that
12515                                                              following
12516                                                              loads will not see
12517                                                              stale data.
12518
12519      atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12520
12521                                                            - If CU wavefront execution
12522                                                              mode, omit.
12523                                                            - If OpenCL, omit.
12524                                                            - Could be split into
12525                                                              separate s_waitcnt
12526                                                              vmcnt(0) and s_waitcnt
12527                                                              vscnt(0) to allow
12528                                                              them to be
12529                                                              independently moved
12530                                                              according to the
12531                                                              following rules.
12532                                                            - s_waitcnt vmcnt(0)
12533                                                              must happen after
12534                                                              any preceding
12535                                                              global/generic load/load
12536                                                              atomic/
12537                                                              atomicrmw-with-return-value.
12538                                                            - s_waitcnt vscnt(0)
12539                                                              must happen after
12540                                                              any preceding
12541                                                              global/generic
12542                                                              store/store atomic/
12543                                                              atomicrmw-no-return-value.
12544                                                            - Must happen before
12545                                                              the following
12546                                                              store.
12547                                                            - Ensures that all
12548                                                              global memory
12549                                                              operations have
12550                                                              completed before
12551                                                              performing the
12552                                                              store that is being
12553                                                              released.
12554
12555                                                          2. ds_atomic
12556                                                          3. s_waitcnt lgkmcnt(0)
12557
12558                                                            - If OpenCL, omit.
12559                                                            - Must happen before
12560                                                              the following
12561                                                              buffer_gl0_inv.
12562                                                            - Ensures any
12563                                                              following global
12564                                                              data read is no
12565                                                              older than the local load
12566                                                              atomic value being
12567                                                              acquired.
12568
12569                                                          4. buffer_gl0_inv
12570
12571                                                            - If CU wavefront execution
12572                                                              mode, omit.
12573                                                            - If OpenCL omit.
12574                                                            - Ensures that
12575                                                              following
12576                                                              loads will not see
12577                                                              stale data.
12578
12579      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) &
12580                                                             vmcnt(0) & vscnt(0)
12581
12582                                                            - If CU wavefront execution
12583                                                              mode, omit vmcnt(0) and
12584                                                              vscnt(0).
12585                                                            - If OpenCL, omit lgkmcnt(0).
12586                                                            - Could be split into
12587                                                              separate s_waitcnt
12588                                                              vmcnt(0), s_waitcnt
12589                                                              vscnt(0) and s_waitcnt
12590                                                              lgkmcnt(0) to allow
12591                                                              them to be
12592                                                              independently moved
12593                                                              according to the
12594                                                              following rules.
12595                                                            - s_waitcnt vmcnt(0)
12596                                                              must happen after
12597                                                              any preceding
12598                                                              global/generic load/load
12599                                                              atomic/
12600                                                              atomicrmw-with-return-value.
12601                                                            - s_waitcnt vscnt(0)
12602                                                              must happen after
12603                                                              any preceding
12604                                                              global/generic
12605                                                              store/store
12606                                                              atomic/
12607                                                              atomicrmw-no-return-value.
12608                                                            - s_waitcnt lgkmcnt(0)
12609                                                              must happen after
12610                                                              any preceding
12611                                                              local/generic
12612                                                              load/store/load
12613                                                              atomic/store
12614                                                              atomic/atomicrmw.
12615                                                            - Must happen before
12616                                                              the following
12617                                                              atomicrmw.
12618                                                            - Ensures that all
12619                                                              memory operations
12620                                                              have
12621                                                              completed before
12622                                                              performing the
12623                                                              atomicrmw that is
12624                                                              being released.
12625
12626                                                          2. flat_atomic
12627                                                          3. s_waitcnt lgkmcnt(0) &
12628                                                             vmcnt(0) & vscnt(0)
12629
12630                                                            - If CU wavefront execution
12631                                                              mode, omit vmcnt(0) and
12632                                                              vscnt(0).
12633                                                            - If OpenCL, omit lgkmcnt(0).
12634                                                            - Must happen before
12635                                                              the following
12636                                                              buffer_gl0_inv.
12637                                                            - Ensures any
12638                                                              following global
12639                                                              data read is no
12640                                                              older than the load
12641                                                              atomic value being
12642                                                              acquired.
12643
12644                                                          3. buffer_gl0_inv
12645
12646                                                            - If CU wavefront execution
12647                                                              mode, omit.
12648                                                            - Ensures that
12649                                                              following
12650                                                              loads will not see
12651                                                              stale data.
12652
12653      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12654                                - system                     vmcnt(0) & vscnt(0)
12655
12656                                                            - If OpenCL, omit
12657                                                              lgkmcnt(0).
12658                                                            - Could be split into
12659                                                              separate s_waitcnt
12660                                                              vmcnt(0), s_waitcnt
12661                                                              vscnt(0) and s_waitcnt
12662                                                              lgkmcnt(0) to allow
12663                                                              them to be
12664                                                              independently moved
12665                                                              according to the
12666                                                              following rules.
12667                                                            - s_waitcnt vmcnt(0)
12668                                                              must happen after
12669                                                              any preceding
12670                                                              global/generic
12671                                                              load/load atomic/
12672                                                              atomicrmw-with-return-value.
12673                                                            - s_waitcnt vscnt(0)
12674                                                              must happen after
12675                                                              any preceding
12676                                                              global/generic
12677                                                              store/store atomic/
12678                                                              atomicrmw-no-return-value.
12679                                                            - s_waitcnt lgkmcnt(0)
12680                                                              must happen after
12681                                                              any preceding
12682                                                              local/generic
12683                                                              load/store/load
12684                                                              atomic/store
12685                                                              atomic/atomicrmw.
12686                                                            - Must happen before
12687                                                              the following
12688                                                              atomicrmw.
12689                                                            - Ensures that all
12690                                                              memory operations
12691                                                              to global have
12692                                                              completed before
12693                                                              performing the
12694                                                              atomicrmw that is
12695                                                              being released.
12696
12697                                                          2. buffer/global_atomic
12698                                                          3. s_waitcnt vm/vscnt(0)
12699
12700                                                            - Use vmcnt(0) if atomic with
12701                                                              return and vscnt(0) if
12702                                                              atomic with no-return.
12703                                                            - Must happen before
12704                                                              following
12705                                                              buffer_gl*_inv.
12706                                                            - Ensures the
12707                                                              atomicrmw has
12708                                                              completed before
12709                                                              invalidating the
12710                                                              caches.
12711
12712                                                          4. buffer_gl0_inv;
12713                                                             buffer_gl1_inv
12714
12715                                                            - Must happen before
12716                                                              any following
12717                                                              global/generic
12718                                                              load/load
12719                                                              atomic/atomicrmw.
12720                                                            - Ensures that
12721                                                              following loads
12722                                                              will not see stale
12723                                                              global data.
12724
12725      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
12726                                - system                     vmcnt(0) & vscnt(0)
12727
12728                                                            - If OpenCL, omit
12729                                                              lgkmcnt(0).
12730                                                            - Could be split into
12731                                                              separate s_waitcnt
12732                                                              vmcnt(0), s_waitcnt
12733                                                              vscnt(0), and s_waitcnt
12734                                                              lgkmcnt(0) to allow
12735                                                              them to be
12736                                                              independently moved
12737                                                              according to the
12738                                                              following rules.
12739                                                            - s_waitcnt vmcnt(0)
12740                                                              must happen after
12741                                                              any preceding
12742                                                              global/generic
12743                                                              load/load atomic
12744                                                              atomicrmw-with-return-value.
12745                                                            - s_waitcnt vscnt(0)
12746                                                              must happen after
12747                                                              any preceding
12748                                                              global/generic
12749                                                              store/store atomic/
12750                                                              atomicrmw-no-return-value.
12751                                                            - s_waitcnt lgkmcnt(0)
12752                                                              must happen after
12753                                                              any preceding
12754                                                              local/generic
12755                                                              load/store/load
12756                                                              atomic/store
12757                                                              atomic/atomicrmw.
12758                                                            - Must happen before
12759                                                              the following
12760                                                              atomicrmw.
12761                                                            - Ensures that all
12762                                                              memory operations
12763                                                              have
12764                                                              completed before
12765                                                              performing the
12766                                                              atomicrmw that is
12767                                                              being released.
12768
12769                                                          2. flat_atomic
12770                                                          3. s_waitcnt vm/vscnt(0) &
12771                                                             lgkmcnt(0)
12772
12773                                                            - If OpenCL, omit
12774                                                              lgkmcnt(0).
12775                                                            - Use vmcnt(0) if atomic with
12776                                                              return and vscnt(0) if
12777                                                              atomic with no-return.
12778                                                            - Must happen before
12779                                                              following
12780                                                              buffer_gl*_inv.
12781                                                            - Ensures the
12782                                                              atomicrmw has
12783                                                              completed before
12784                                                              invalidating the
12785                                                              caches.
12786
12787                                                          4. buffer_gl0_inv;
12788                                                             buffer_gl1_inv
12789
12790                                                            - Must happen before
12791                                                              any following
12792                                                              global/generic
12793                                                              load/load
12794                                                              atomic/atomicrmw.
12795                                                            - Ensures that
12796                                                              following loads
12797                                                              will not see stale
12798                                                              global data.
12799
12800      fence        acq_rel      - singlethread *none*     *none*
12801                                - wavefront
12802      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12803                                                             vmcnt(0) & vscnt(0)
12804
12805                                                            - If CU wavefront execution
12806                                                              mode, omit vmcnt(0) and
12807                                                              vscnt(0).
12808                                                            - If OpenCL and
12809                                                              address space is
12810                                                              not generic, omit
12811                                                              lgkmcnt(0).
12812                                                            - If OpenCL and
12813                                                              address space is
12814                                                              local, omit
12815                                                              vmcnt(0) and vscnt(0).
12816                                                            - However,
12817                                                              since LLVM
12818                                                              currently has no
12819                                                              address space on
12820                                                              the fence need to
12821                                                              conservatively
12822                                                              always generate
12823                                                              (see comment for
12824                                                              previous fence).
12825                                                            - Could be split into
12826                                                              separate s_waitcnt
12827                                                              vmcnt(0), s_waitcnt
12828                                                              vscnt(0) and s_waitcnt
12829                                                              lgkmcnt(0) to allow
12830                                                              them to be
12831                                                              independently moved
12832                                                              according to the
12833                                                              following rules.
12834                                                            - s_waitcnt vmcnt(0)
12835                                                              must happen after
12836                                                              any preceding
12837                                                              global/generic
12838                                                              load/load
12839                                                              atomic/
12840                                                              atomicrmw-with-return-value.
12841                                                            - s_waitcnt vscnt(0)
12842                                                              must happen after
12843                                                              any preceding
12844                                                              global/generic
12845                                                              store/store atomic/
12846                                                              atomicrmw-no-return-value.
12847                                                            - s_waitcnt lgkmcnt(0)
12848                                                              must happen after
12849                                                              any preceding
12850                                                              local/generic
12851                                                              load/store/load
12852                                                              atomic/store atomic/
12853                                                              atomicrmw.
12854                                                            - Must happen before
12855                                                              any following
12856                                                              global/generic
12857                                                              load/load
12858                                                              atomic/store/store
12859                                                              atomic/atomicrmw.
12860                                                            - Ensures that all
12861                                                              memory operations
12862                                                              have
12863                                                              completed before
12864                                                              performing any
12865                                                              following global
12866                                                              memory operations.
12867                                                            - Ensures that the
12868                                                              preceding
12869                                                              local/generic load
12870                                                              atomic/atomicrmw
12871                                                              with an equal or
12872                                                              wider sync scope
12873                                                              and memory ordering
12874                                                              stronger than
12875                                                              unordered (this is
12876                                                              termed the
12877                                                              acquire-fence-paired-atomic)
12878                                                              has completed
12879                                                              before following
12880                                                              global memory
12881                                                              operations. This
12882                                                              satisfies the
12883                                                              requirements of
12884                                                              acquire.
12885                                                            - Ensures that all
12886                                                              previous memory
12887                                                              operations have
12888                                                              completed before a
12889                                                              following
12890                                                              local/generic store
12891                                                              atomic/atomicrmw
12892                                                              with an equal or
12893                                                              wider sync scope
12894                                                              and memory ordering
12895                                                              stronger than
12896                                                              unordered (this is
12897                                                              termed the
12898                                                              release-fence-paired-atomic).
12899                                                              This satisfies the
12900                                                              requirements of
12901                                                              release.
12902                                                            - Must happen before
12903                                                              the following
12904                                                              buffer_gl0_inv.
12905                                                            - Ensures that the
12906                                                              acquire-fence-paired
12907                                                              atomic has completed
12908                                                              before invalidating
12909                                                              the
12910                                                              cache. Therefore
12911                                                              any following
12912                                                              locations read must
12913                                                              be no older than
12914                                                              the value read by
12915                                                              the
12916                                                              acquire-fence-paired-atomic.
12917
12918                                                          3. buffer_gl0_inv
12919
12920                                                            - If CU wavefront execution
12921                                                              mode, omit.
12922                                                            - Ensures that
12923                                                              following
12924                                                              loads will not see
12925                                                              stale data.
12926
12927      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12928                                - system                     vmcnt(0) & vscnt(0)
12929
12930                                                            - If OpenCL and
12931                                                              address space is
12932                                                              not generic, omit
12933                                                              lgkmcnt(0).
12934                                                            - If OpenCL and
12935                                                              address space is
12936                                                              local, omit
12937                                                              vmcnt(0) and vscnt(0).
12938                                                            - However, since LLVM
12939                                                              currently has no
12940                                                              address space on
12941                                                              the fence need to
12942                                                              conservatively
12943                                                              always generate
12944                                                              (see comment for
12945                                                              previous fence).
12946                                                            - Could be split into
12947                                                              separate s_waitcnt
12948                                                              vmcnt(0), s_waitcnt
12949                                                              vscnt(0) and s_waitcnt
12950                                                              lgkmcnt(0) to allow
12951                                                              them to be
12952                                                              independently moved
12953                                                              according to the
12954                                                              following rules.
12955                                                            - s_waitcnt vmcnt(0)
12956                                                              must happen after
12957                                                              any preceding
12958                                                              global/generic
12959                                                              load/load
12960                                                              atomic/
12961                                                              atomicrmw-with-return-value.
12962                                                            - s_waitcnt vscnt(0)
12963                                                              must happen after
12964                                                              any preceding
12965                                                              global/generic
12966                                                              store/store atomic/
12967                                                              atomicrmw-no-return-value.
12968                                                            - s_waitcnt lgkmcnt(0)
12969                                                              must happen after
12970                                                              any preceding
12971                                                              local/generic
12972                                                              load/store/load
12973                                                              atomic/store
12974                                                              atomic/atomicrmw.
12975                                                            - Must happen before
12976                                                              the following
12977                                                              buffer_gl*_inv.
12978                                                            - Ensures that the
12979                                                              preceding
12980                                                              global/local/generic
12981                                                              load
12982                                                              atomic/atomicrmw
12983                                                              with an equal or
12984                                                              wider sync scope
12985                                                              and memory ordering
12986                                                              stronger than
12987                                                              unordered (this is
12988                                                              termed the
12989                                                              acquire-fence-paired-atomic)
12990                                                              has completed
12991                                                              before invalidating
12992                                                              the caches. This
12993                                                              satisfies the
12994                                                              requirements of
12995                                                              acquire.
12996                                                            - Ensures that all
12997                                                              previous memory
12998                                                              operations have
12999                                                              completed before a
13000                                                              following
13001                                                              global/local/generic
13002                                                              store
13003                                                              atomic/atomicrmw
13004                                                              with an equal or
13005                                                              wider sync scope
13006                                                              and memory ordering
13007                                                              stronger than
13008                                                              unordered (this is
13009                                                              termed the
13010                                                              release-fence-paired-atomic).
13011                                                              This satisfies the
13012                                                              requirements of
13013                                                              release.
13014
13015                                                          2. buffer_gl0_inv;
13016                                                             buffer_gl1_inv
13017
13018                                                            - Must happen before
13019                                                              any following
13020                                                              global/generic
13021                                                              load/load
13022                                                              atomic/store/store
13023                                                              atomic/atomicrmw.
13024                                                            - Ensures that
13025                                                              following loads
13026                                                              will not see stale
13027                                                              global data. This
13028                                                              satisfies the
13029                                                              requirements of
13030                                                              acquire.
13031
13032      **Sequential Consistent Atomic**
13033      ------------------------------------------------------------------------------------
13034      load atomic  seq_cst      - singlethread - global   *Same as corresponding
13035                                - wavefront    - local    load atomic acquire,
13036                                               - generic  except must generate
13037                                                          all instructions even
13038                                                          for OpenCL.*
13039      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
13040                                               - generic     vmcnt(0) & vscnt(0)
13041
13042                                                            - If CU wavefront execution
13043                                                              mode, omit vmcnt(0) and
13044                                                              vscnt(0).
13045                                                            - Could be split into
13046                                                              separate s_waitcnt
13047                                                              vmcnt(0), s_waitcnt
13048                                                              vscnt(0), and s_waitcnt
13049                                                              lgkmcnt(0) to allow
13050                                                              them to be
13051                                                              independently moved
13052                                                              according to the
13053                                                              following rules.
13054                                                            - s_waitcnt lgkmcnt(0) must
13055                                                              happen after
13056                                                              preceding
13057                                                              local/generic load
13058                                                              atomic/store
13059                                                              atomic/atomicrmw
13060                                                              with memory
13061                                                              ordering of seq_cst
13062                                                              and with equal or
13063                                                              wider sync scope.
13064                                                              (Note that seq_cst
13065                                                              fences have their
13066                                                              own s_waitcnt
13067                                                              lgkmcnt(0) and so do
13068                                                              not need to be
13069                                                              considered.)
13070                                                            - s_waitcnt vmcnt(0)
13071                                                              must happen after
13072                                                              preceding
13073                                                              global/generic load
13074                                                              atomic/
13075                                                              atomicrmw-with-return-value
13076                                                              with memory
13077                                                              ordering of seq_cst
13078                                                              and with equal or
13079                                                              wider sync scope.
13080                                                              (Note that seq_cst
13081                                                              fences have their
13082                                                              own s_waitcnt
13083                                                              vmcnt(0) and so do
13084                                                              not need to be
13085                                                              considered.)
13086                                                            - s_waitcnt vscnt(0)
13087                                                              Must happen after
13088                                                              preceding
13089                                                              global/generic store
13090                                                              atomic/
13091                                                              atomicrmw-no-return-value
13092                                                              with memory
13093                                                              ordering of seq_cst
13094                                                              and with equal or
13095                                                              wider sync scope.
13096                                                              (Note that seq_cst
13097                                                              fences have their
13098                                                              own s_waitcnt
13099                                                              vscnt(0) and so do
13100                                                              not need to be
13101                                                              considered.)
13102                                                            - Ensures any
13103                                                              preceding
13104                                                              sequential
13105                                                              consistent global/local
13106                                                              memory instructions
13107                                                              have completed
13108                                                              before executing
13109                                                              this sequentially
13110                                                              consistent
13111                                                              instruction. This
13112                                                              prevents reordering
13113                                                              a seq_cst store
13114                                                              followed by a
13115                                                              seq_cst load. (Note
13116                                                              that seq_cst is
13117                                                              stronger than
13118                                                              acquire/release as
13119                                                              the reordering of
13120                                                              load acquire
13121                                                              followed by a store
13122                                                              release is
13123                                                              prevented by the
13124                                                              s_waitcnt of
13125                                                              the release, but
13126                                                              there is nothing
13127                                                              preventing a store
13128                                                              release followed by
13129                                                              load acquire from
13130                                                              completing out of
13131                                                              order. The s_waitcnt
13132                                                              could be placed after
13133                                                              seq_store or before
13134                                                              the seq_load. We
13135                                                              choose the load to
13136                                                              make the s_waitcnt be
13137                                                              as late as possible
13138                                                              so that the store
13139                                                              may have already
13140                                                              completed.)
13141
13142                                                          2. *Following
13143                                                             instructions same as
13144                                                             corresponding load
13145                                                             atomic acquire,
13146                                                             except must generate
13147                                                             all instructions even
13148                                                             for OpenCL.*
13149      load atomic  seq_cst      - workgroup    - local
13150
13151                                                          1. s_waitcnt vmcnt(0) & vscnt(0)
13152
13153                                                            - If CU wavefront execution
13154                                                              mode, omit.
13155                                                            - Could be split into
13156                                                              separate s_waitcnt
13157                                                              vmcnt(0) and s_waitcnt
13158                                                              vscnt(0) to allow
13159                                                              them to be
13160                                                              independently moved
13161                                                              according to the
13162                                                              following rules.
13163                                                            - s_waitcnt vmcnt(0)
13164                                                              Must happen after
13165                                                              preceding
13166                                                              global/generic load
13167                                                              atomic/
13168                                                              atomicrmw-with-return-value
13169                                                              with memory
13170                                                              ordering of seq_cst
13171                                                              and with equal or
13172                                                              wider sync scope.
13173                                                              (Note that seq_cst
13174                                                              fences have their
13175                                                              own s_waitcnt
13176                                                              vmcnt(0) and so do
13177                                                              not need to be
13178                                                              considered.)
13179                                                            - s_waitcnt vscnt(0)
13180                                                              Must happen after
13181                                                              preceding
13182                                                              global/generic store
13183                                                              atomic/
13184                                                              atomicrmw-no-return-value
13185                                                              with memory
13186                                                              ordering of seq_cst
13187                                                              and with equal or
13188                                                              wider sync scope.
13189                                                              (Note that seq_cst
13190                                                              fences have their
13191                                                              own s_waitcnt
13192                                                              vscnt(0) and so do
13193                                                              not need to be
13194                                                              considered.)
13195                                                            - Ensures any
13196                                                              preceding
13197                                                              sequential
13198                                                              consistent global
13199                                                              memory instructions
13200                                                              have completed
13201                                                              before executing
13202                                                              this sequentially
13203                                                              consistent
13204                                                              instruction. This
13205                                                              prevents reordering
13206                                                              a seq_cst store
13207                                                              followed by a
13208                                                              seq_cst load. (Note
13209                                                              that seq_cst is
13210                                                              stronger than
13211                                                              acquire/release as
13212                                                              the reordering of
13213                                                              load acquire
13214                                                              followed by a store
13215                                                              release is
13216                                                              prevented by the
13217                                                              s_waitcnt of
13218                                                              the release, but
13219                                                              there is nothing
13220                                                              preventing a store
13221                                                              release followed by
13222                                                              load acquire from
13223                                                              completing out of
13224                                                              order. The s_waitcnt
13225                                                              could be placed after
13226                                                              seq_store or before
13227                                                              the seq_load. We
13228                                                              choose the load to
13229                                                              make the s_waitcnt be
13230                                                              as late as possible
13231                                                              so that the store
13232                                                              may have already
13233                                                              completed.)
13234
13235                                                          2. *Following
13236                                                             instructions same as
13237                                                             corresponding load
13238                                                             atomic acquire,
13239                                                             except must generate
13240                                                             all instructions even
13241                                                             for OpenCL.*
13242
13243      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
13244                                - system       - generic     vmcnt(0) & vscnt(0)
13245
13246                                                            - Could be split into
13247                                                              separate s_waitcnt
13248                                                              vmcnt(0), s_waitcnt
13249                                                              vscnt(0) and s_waitcnt
13250                                                              lgkmcnt(0) to allow
13251                                                              them to be
13252                                                              independently moved
13253                                                              according to the
13254                                                              following rules.
13255                                                            - s_waitcnt lgkmcnt(0)
13256                                                              must happen after
13257                                                              preceding
13258                                                              local load
13259                                                              atomic/store
13260                                                              atomic/atomicrmw
13261                                                              with memory
13262                                                              ordering of seq_cst
13263                                                              and with equal or
13264                                                              wider sync scope.
13265                                                              (Note that seq_cst
13266                                                              fences have their
13267                                                              own s_waitcnt
13268                                                              lgkmcnt(0) and so do
13269                                                              not need to be
13270                                                              considered.)
13271                                                            - s_waitcnt vmcnt(0)
13272                                                              must happen after
13273                                                              preceding
13274                                                              global/generic load
13275                                                              atomic/
13276                                                              atomicrmw-with-return-value
13277                                                              with memory
13278                                                              ordering of seq_cst
13279                                                              and with equal or
13280                                                              wider sync scope.
13281                                                              (Note that seq_cst
13282                                                              fences have their
13283                                                              own s_waitcnt
13284                                                              vmcnt(0) and so do
13285                                                              not need to be
13286                                                              considered.)
13287                                                            - s_waitcnt vscnt(0)
13288                                                              Must happen after
13289                                                              preceding
13290                                                              global/generic store
13291                                                              atomic/
13292                                                              atomicrmw-no-return-value
13293                                                              with memory
13294                                                              ordering of seq_cst
13295                                                              and with equal or
13296                                                              wider sync scope.
13297                                                              (Note that seq_cst
13298                                                              fences have their
13299                                                              own s_waitcnt
13300                                                              vscnt(0) and so do
13301                                                              not need to be
13302                                                              considered.)
13303                                                            - Ensures any
13304                                                              preceding
13305                                                              sequential
13306                                                              consistent global
13307                                                              memory instructions
13308                                                              have completed
13309                                                              before executing
13310                                                              this sequentially
13311                                                              consistent
13312                                                              instruction. This
13313                                                              prevents reordering
13314                                                              a seq_cst store
13315                                                              followed by a
13316                                                              seq_cst load. (Note
13317                                                              that seq_cst is
13318                                                              stronger than
13319                                                              acquire/release as
13320                                                              the reordering of
13321                                                              load acquire
13322                                                              followed by a store
13323                                                              release is
13324                                                              prevented by the
13325                                                              s_waitcnt of
13326                                                              the release, but
13327                                                              there is nothing
13328                                                              preventing a store
13329                                                              release followed by
13330                                                              load acquire from
13331                                                              completing out of
13332                                                              order. The s_waitcnt
13333                                                              could be placed after
13334                                                              seq_store or before
13335                                                              the seq_load. We
13336                                                              choose the load to
13337                                                              make the s_waitcnt be
13338                                                              as late as possible
13339                                                              so that the store
13340                                                              may have already
13341                                                              completed.)
13342
13343                                                          2. *Following
13344                                                             instructions same as
13345                                                             corresponding load
13346                                                             atomic acquire,
13347                                                             except must generate
13348                                                             all instructions even
13349                                                             for OpenCL.*
13350      store atomic seq_cst      - singlethread - global   *Same as corresponding
13351                                - wavefront    - local    store atomic release,
13352                                - workgroup    - generic  except must generate
13353                                - agent                   all instructions even
13354                                - system                  for OpenCL.*
13355      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
13356                                - wavefront    - local    atomicrmw acq_rel,
13357                                - workgroup    - generic  except must generate
13358                                - agent                   all instructions even
13359                                - system                  for OpenCL.*
13360      fence        seq_cst      - singlethread *none*     *Same as corresponding
13361                                - wavefront               fence acq_rel,
13362                                - workgroup               except must generate
13363                                - agent                   all instructions even
13364                                - system                  for OpenCL.*
13365      ============ ============ ============== ========== ================================
13366
13367 .. _amdgpu-amdhsa-trap-handler-abi:
13368
13369 Trap Handler ABI
13370 ~~~~~~~~~~~~~~~~
13371
13372 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
13373 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
13374 supports the ``s_trap`` instruction. For usage see:
13375
13376 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
13377 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
13378 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
13379
13380   .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
13381      :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
13382
13383      =================== =============== =============== =======================================
13384      Usage               Code Sequence   Trap Handler    Description
13385                                          Inputs
13386      =================== =============== =============== =======================================
13387      reserved            ``s_trap 0x00``                 Reserved by hardware.
13388      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for Finalizer HSA ``debugtrap``
13389                                            ``queue_ptr`` intrinsic (not implemented).
13390                                          ``VGPR0``:
13391                                            ``arg``
13392      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
13393                                            ``queue_ptr`` the trap instruction. The associated
13394                                                          queue is signalled to put it into the
13395                                                          error state.  When the queue is put in
13396                                                          the error state, the waves executing
13397                                                          dispatches on the queue will be
13398                                                          terminated.
13399      ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
13400                                                            as a no-operation. The trap handler
13401                                                            is entered and immediately returns to
13402                                                            continue execution of the wavefront.
13403                                                          - If the debugger is enabled, causes
13404                                                            the debug trap to be reported by the
13405                                                            debugger and the wavefront is put in
13406                                                            the halt state with the PC at the
13407                                                            instruction.  The debugger must
13408                                                            increment the PC and resume the wave.
13409      reserved            ``s_trap 0x04``                 Reserved.
13410      reserved            ``s_trap 0x05``                 Reserved.
13411      reserved            ``s_trap 0x06``                 Reserved.
13412      reserved            ``s_trap 0x07``                 Reserved.
13413      reserved            ``s_trap 0x08``                 Reserved.
13414      reserved            ``s_trap 0xfe``                 Reserved.
13415      reserved            ``s_trap 0xff``                 Reserved.
13416      =================== =============== =============== =======================================
13417
13418 ..
13419
13420   .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
13421      :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
13422
13423      =================== =============== =============== =======================================
13424      Usage               Code Sequence   Trap Handler    Description
13425                                          Inputs
13426      =================== =============== =============== =======================================
13427      reserved            ``s_trap 0x00``                 Reserved by hardware.
13428      debugger breakpoint ``s_trap 0x01`` *none*          Reserved for debugger to use for
13429                                                          breakpoints. Causes wave to be halted
13430                                                          with the PC at the trap instruction.
13431                                                          The debugger is responsible to resume
13432                                                          the wave, including the instruction
13433                                                          that the breakpoint overwrote.
13434      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
13435                                            ``queue_ptr`` the trap instruction. The associated
13436                                                          queue is signalled to put it into the
13437                                                          error state.  When the queue is put in
13438                                                          the error state, the waves executing
13439                                                          dispatches on the queue will be
13440                                                          terminated.
13441      ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
13442                                                            as a no-operation. The trap handler
13443                                                            is entered and immediately returns to
13444                                                            continue execution of the wavefront.
13445                                                          - If the debugger is enabled, causes
13446                                                            the debug trap to be reported by the
13447                                                            debugger and the wavefront is put in
13448                                                            the halt state with the PC at the
13449                                                            instruction.  The debugger must
13450                                                            increment the PC and resume the wave.
13451      reserved            ``s_trap 0x04``                 Reserved.
13452      reserved            ``s_trap 0x05``                 Reserved.
13453      reserved            ``s_trap 0x06``                 Reserved.
13454      reserved            ``s_trap 0x07``                 Reserved.
13455      reserved            ``s_trap 0x08``                 Reserved.
13456      reserved            ``s_trap 0xfe``                 Reserved.
13457      reserved            ``s_trap 0xff``                 Reserved.
13458      =================== =============== =============== =======================================
13459
13460 ..
13461
13462   .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
13463      :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
13464
13465      =================== =============== ================ ================= =======================================
13466      Usage               Code Sequence   GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
13467      =================== =============== ================ ================= =======================================
13468      reserved            ``s_trap 0x00``                                    Reserved by hardware.
13469      debugger breakpoint ``s_trap 0x01`` *none*           *none*            Reserved for debugger to use for
13470                                                                             breakpoints. Causes wave to be halted
13471                                                                             with the PC at the trap instruction.
13472                                                                             The debugger is responsible to resume
13473                                                                             the wave, including the instruction
13474                                                                             that the breakpoint overwrote.
13475      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:     *none*            Causes wave to be halted with the PC at
13476                                            ``queue_ptr``                    the trap instruction. The associated
13477                                                                             queue is signalled to put it into the
13478                                                                             error state.  When the queue is put in
13479                                                                             the error state, the waves executing
13480                                                                             dispatches on the queue will be
13481                                                                             terminated.
13482      ``llvm.debugtrap``  ``s_trap 0x03`` *none*           *none*            - If debugger not enabled then behaves
13483                                                                               as a no-operation. The trap handler
13484                                                                               is entered and immediately returns to
13485                                                                               continue execution of the wavefront.
13486                                                                             - If the debugger is enabled, causes
13487                                                                               the debug trap to be reported by the
13488                                                                               debugger and the wavefront is put in
13489                                                                               the halt state with the PC at the
13490                                                                               instruction.  The debugger must
13491                                                                               increment the PC and resume the wave.
13492      reserved            ``s_trap 0x04``                                    Reserved.
13493      reserved            ``s_trap 0x05``                                    Reserved.
13494      reserved            ``s_trap 0x06``                                    Reserved.
13495      reserved            ``s_trap 0x07``                                    Reserved.
13496      reserved            ``s_trap 0x08``                                    Reserved.
13497      reserved            ``s_trap 0xfe``                                    Reserved.
13498      reserved            ``s_trap 0xff``                                    Reserved.
13499      =================== =============== ================ ================= =======================================
13500
13501 .. _amdgpu-amdhsa-function-call-convention:
13502
13503 Call Convention
13504 ~~~~~~~~~~~~~~~
13505
13506 .. note::
13507
13508   This section is currently incomplete and has inaccuracies. It is WIP that will
13509   be updated as information is determined.
13510
13511 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
13512 addresses. Unswizzled addresses are normal linear addresses.
13513
13514 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
13515
13516 Kernel Functions
13517 ++++++++++++++++
13518
13519 This section describes the call convention ABI for the outer kernel function.
13520
13521 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
13522 convention.
13523
13524 The following is not part of the AMDGPU kernel calling convention but describes
13525 how the AMDGPU implements function calls:
13526
13527 1.  Clang decides the kernarg layout to match the *HSA Programmer's Language
13528     Reference* [HSA]_.
13529
13530     - All structs are passed directly.
13531     - Lambda values are passed *TBA*.
13532
13533     .. TODO::
13534
13535       - Does this really follow HSA rules? Or are structs >16 bytes passed
13536         by-value struct?
13537       - What is ABI for lambda values?
13538
13539 4.  The kernel performs certain setup in its prolog, as described in
13540     :ref:`amdgpu-amdhsa-kernel-prolog`.
13541
13542 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
13543
13544 Non-Kernel Functions
13545 ++++++++++++++++++++
13546
13547 This section describes the call convention ABI for functions other than the
13548 outer kernel function.
13549
13550 If a kernel has function calls then scratch is always allocated and used for
13551 the call stack which grows from low address to high address using the swizzled
13552 scratch address space.
13553
13554 On entry to a function:
13555
13556 1.  SGPR0-3 contain a V# with the following properties (see
13557     :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
13558
13559     * Base address pointing to the beginning of the wavefront scratch backing
13560       memory.
13561     * Swizzled with dword element size and stride of wavefront size elements.
13562
13563 2.  The FLAT_SCRATCH register pair is setup. See
13564     :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
13565 3.  GFX6-GFX8: M0 register set to the size of LDS in bytes. See
13566     :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
13567 4.  The EXEC register is set to the lanes active on entry to the function.
13568 5.  MODE register: *TBD*
13569 6.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
13570     below.
13571 7.  SGPR30-31 return address (RA). The code address that the function must
13572     return to when it completes. The value is undefined if the function is *no
13573     return*.
13574 8.  SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
13575     offset relative to the beginning of the wavefront scratch backing memory.
13576
13577     The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
13578     offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
13579     manner.
13580
13581     The unswizzled SP value can be converted into the swizzled SP value by:
13582
13583       | swizzled SP = unswizzled SP / wavefront size
13584
13585     This may be used to obtain the private address space address of stack
13586     objects and to convert this address to a flat address by adding the flat
13587     scratch aperture base address.
13588
13589     The swizzled SP value is always 4 bytes aligned for the ``r600``
13590     architecture and 16 byte aligned for the ``amdgcn`` architecture.
13591
13592     .. note::
13593
13594       The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
13595       OpenCL language which has the largest base type defined as 16 bytes.
13596
13597     On entry, the swizzled SP value is the address of the first function
13598     argument passed on the stack. Other stack passed arguments are positive
13599     offsets from the entry swizzled SP value.
13600
13601     The function may use positive offsets beyond the last stack passed argument
13602     for stack allocated local variables and register spill slots. If necessary,
13603     the function may align these to greater alignment than 16 bytes. After these
13604     the function may dynamically allocate space for such things as runtime sized
13605     ``alloca`` local allocations.
13606
13607     If the function calls another function, it will place any stack allocated
13608     arguments after the last local allocation and adjust SGPR32 to the address
13609     after the last local allocation.
13610
13611 9.  All other registers are unspecified.
13612 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
13613     to the function.
13614
13615 On exit from a function:
13616
13617 1.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
13618     described below. Any registers used are considered clobbered registers.
13619 2.  The following registers are preserved and have the same value as on entry:
13620
13621     * FLAT_SCRATCH
13622     * EXEC
13623     * GFX6-GFX8: M0
13624     * All SGPR registers except the clobbered registers of SGPR4-31.
13625     * VGPR40-47
13626     * VGPR56-63
13627     * VGPR72-79
13628     * VGPR88-95
13629     * VGPR104-111
13630     * VGPR120-127
13631     * VGPR136-143
13632     * VGPR152-159
13633     * VGPR168-175
13634     * VGPR184-191
13635     * VGPR200-207
13636     * VGPR216-223
13637     * VGPR232-239
13638     * VGPR248-255
13639
13640         .. note::
13641
13642           Except the argument registers, the VGPRs clobbered and the preserved
13643           registers are intermixed at regular intervals in order to keep a
13644           similar ratio independent of the number of allocated VGPRs.
13645
13646     * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
13647     * Lanes of all VGPRs that are inactive at the call site.
13648
13649       For the AMDGPU backend, an inter-procedural register allocation (IPRA)
13650       optimization may mark some of clobbered SGPR and VGPR registers as
13651       preserved if it can be determined that the called function does not change
13652       their value.
13653
13654 2.  The PC is set to the RA provided on entry.
13655 3.  MODE register: *TBD*.
13656 4.  All other registers are clobbered.
13657 5.  Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
13658     function is available to the caller.
13659
13660 .. TODO::
13661
13662   - How are function results returned? The address of structured types is passed
13663     by reference, but what about other types?
13664
13665 The function input arguments are made up of the formal arguments explicitly
13666 declared by the source language function plus the implicit input arguments used
13667 by the implementation.
13668
13669 The source language input arguments are:
13670
13671 1. Any source language implicit ``this`` or ``self`` argument comes first as a
13672    pointer type.
13673 2. Followed by the function formal arguments in left to right source order.
13674
13675 The source language result arguments are:
13676
13677 1. The function result argument.
13678
13679 The source language input or result struct type arguments that are less than or
13680 equal to 16 bytes, are decomposed recursively into their base type fields, and
13681 each field is passed as if a separate argument. For input arguments, if the
13682 called function requires the struct to be in memory, for example because its
13683 address is taken, then the function body is responsible for allocating a stack
13684 location and copying the field arguments into it. Clang terms this *direct
13685 struct*.
13686
13687 The source language input struct type arguments that are greater than 16 bytes,
13688 are passed by reference. The caller is responsible for allocating a stack
13689 location to make a copy of the struct value and pass the address as the input
13690 argument. The called function is responsible to perform the dereference when
13691 accessing the input argument. Clang terms this *by-value struct*.
13692
13693 A source language result struct type argument that is greater than 16 bytes, is
13694 returned by reference. The caller is responsible for allocating a stack location
13695 to hold the result value and passes the address as the last input argument
13696 (before the implicit input arguments). In this case there are no result
13697 arguments. The called function is responsible to perform the dereference when
13698 storing the result value. Clang terms this *structured return (sret)*.
13699
13700 *TODO: correct the ``sret`` definition.*
13701
13702 .. TODO::
13703
13704   Is this definition correct? Or is ``sret`` only used if passing in registers, and
13705   pass as non-decomposed struct as stack argument? Or something else? Is the
13706   memory location in the caller stack frame, or a stack memory argument and so
13707   no address is passed as the caller can directly write to the argument stack
13708   location? But then the stack location is still live after return. If an
13709   argument stack location is it the first stack argument or the last one?
13710
13711 Lambda argument types are treated as struct types with an implementation defined
13712 set of fields.
13713
13714 .. TODO::
13715
13716   Need to specify the ABI for lambda types for AMDGPU.
13717
13718 For AMDGPU backend all source language arguments (including the decomposed
13719 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
13720 they are passed in SGPRs.
13721
13722 The AMDGPU backend walks the function call graph from the leaves to determine
13723 which implicit input arguments are used, propagating to each caller of the
13724 function. The used implicit arguments are appended to the function arguments
13725 after the source language arguments in the following order:
13726
13727 .. TODO::
13728
13729   Is recursion or external functions supported?
13730
13731 1.  Work-Item ID (1 VGPR)
13732
13733     The X, Y and Z work-item ID are packed into a single VGRP with the following
13734     layout. Only fields actually used by the function are set. The other bits
13735     are undefined.
13736
13737     The values come from the initial kernel execution state. See
13738     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13739
13740     .. table:: Work-item implicit argument layout
13741       :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
13742
13743       ======= ======= ==============
13744       Bits    Size    Field Name
13745       ======= ======= ==============
13746       9:0     10 bits X Work-Item ID
13747       19:10   10 bits Y Work-Item ID
13748       29:20   10 bits Z Work-Item ID
13749       31:30   2 bits  Unused
13750       ======= ======= ==============
13751
13752 2.  Dispatch Ptr (2 SGPRs)
13753
13754     The value comes from the initial kernel execution state. See
13755     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13756
13757 3.  Queue Ptr (2 SGPRs)
13758
13759     The value comes from the initial kernel execution state. See
13760     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13761
13762 4.  Kernarg Segment Ptr (2 SGPRs)
13763
13764     The value comes from the initial kernel execution state. See
13765     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13766
13767 5.  Dispatch id (2 SGPRs)
13768
13769     The value comes from the initial kernel execution state. See
13770     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13771
13772 6.  Work-Group ID X (1 SGPR)
13773
13774     The value comes from the initial kernel execution state. See
13775     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13776
13777 7.  Work-Group ID Y (1 SGPR)
13778
13779     The value comes from the initial kernel execution state. See
13780     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13781
13782 8.  Work-Group ID Z (1 SGPR)
13783
13784     The value comes from the initial kernel execution state. See
13785     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13786
13787 9.  Implicit Argument Ptr (2 SGPRs)
13788
13789     The value is computed by adding an offset to Kernarg Segment Ptr to get the
13790     global address space pointer to the first kernarg implicit argument.
13791
13792 The input and result arguments are assigned in order in the following manner:
13793
13794 .. note::
13795
13796   There are likely some errors and omissions in the following description that
13797   need correction.
13798
13799   .. TODO::
13800
13801     Check the Clang source code to decipher how function arguments and return
13802     results are handled. Also see the AMDGPU specific values used.
13803
13804 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
13805   VGPR31.
13806
13807   If there are more arguments than will fit in these registers, the remaining
13808   arguments are allocated on the stack in order on naturally aligned
13809   addresses.
13810
13811   .. TODO::
13812
13813     How are overly aligned structures allocated on the stack?
13814
13815 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
13816   SGPR29.
13817
13818   If there are more arguments than will fit in these registers, the remaining
13819   arguments are allocated on the stack in order on naturally aligned
13820   addresses.
13821
13822 Note that decomposed struct type arguments may have some fields passed in
13823 registers and some in memory.
13824
13825 .. TODO::
13826
13827   So, a struct which can pass some fields as decomposed register arguments, will
13828   pass the rest as decomposed stack elements? But an argument that will not start
13829   in registers will not be decomposed and will be passed as a non-decomposed
13830   stack value?
13831
13832 The following is not part of the AMDGPU function calling convention but
13833 describes how the AMDGPU implements function calls:
13834
13835 1.  SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
13836     unswizzled scratch address. It is only needed if runtime sized ``alloca``
13837     are used, or for the reasons defined in ``SIFrameLowering``.
13838 2.  Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
13839     to access the incoming stack arguments in the function. The BP is needed
13840     only when the function requires the runtime stack alignment.
13841
13842 3.  Allocating SGPR arguments on the stack are not supported.
13843
13844 4.  No CFI is currently generated. See
13845     :ref:`amdgpu-dwarf-call-frame-information`.
13846
13847     .. note::
13848
13849       CFI will be generated that defines the CFA as the unswizzled address
13850       relative to the wave scratch base in the unswizzled private address space
13851       of the lowest address stack allocated local variable.
13852
13853       ``DW_AT_frame_base`` will be defined as the swizzled address in the
13854       swizzled private address space by dividing the CFA by the wavefront size
13855       (since CFA is always at least dword aligned which matches the scratch
13856       swizzle element size).
13857
13858       If no dynamic stack alignment was performed, the stack allocated arguments
13859       are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
13860       local variables and register spill slots are accessed as positive offsets
13861       relative to ``DW_AT_frame_base``.
13862
13863 5.  Function argument passing is implemented by copying the input physical
13864     registers to virtual registers on entry. The register allocator can spill if
13865     necessary. These are copied back to physical registers at call sites. The
13866     net effect is that each function call can have these values in entirely
13867     distinct locations. The IPRA can help avoid shuffling argument registers.
13868 6.  Call sites are implemented by setting up the arguments at positive offsets
13869     from SP. Then SP is incremented to account for the known frame size before
13870     the call and decremented after the call.
13871
13872     .. note::
13873
13874       The CFI will reflect the changed calculation needed to compute the CFA
13875       from SP.
13876
13877 7.  4 byte spill slots are used in the stack frame. One slot is allocated for an
13878     emergency spill slot. Buffer instructions are used for stack accesses and
13879     not the ``flat_scratch`` instruction.
13880
13881     .. TODO::
13882
13883       Explain when the emergency spill slot is used.
13884
13885 .. TODO::
13886
13887   Possible broken issues:
13888
13889   - Stack arguments must be aligned to required alignment.
13890   - Stack is aligned to max(16, max formal argument alignment)
13891   - Direct argument < 64 bits should check register budget.
13892   - Register budget calculation should respect ``inreg`` for SGPR.
13893   - SGPR overflow is not handled.
13894   - struct with 1 member unpeeling is not checking size of member.
13895   - ``sret`` is after ``this`` pointer.
13896   - Caller is not implementing stack realignment: need an extra pointer.
13897   - Should say AMDGPU passes FP rather than SP.
13898   - Should CFI define CFA as address of locals or arguments. Difference is
13899     apparent when have implemented dynamic alignment.
13900   - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
13901     highest address of stack frame and use negative offset for locals. Would
13902     allow SP to be the same as FP and could support signal-handler-like as now
13903     have a real SP for the top of the stack.
13904   - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
13905     arguments?
13906
13907 AMDPAL
13908 ------
13909
13910 This section provides code conventions used when the target triple OS is
13911 ``amdpal`` (see :ref:`amdgpu-target-triples`).
13912
13913 .. _amdgpu-amdpal-code-object-metadata-section:
13914
13915 Code Object Metadata
13916 ~~~~~~~~~~~~~~~~~~~~
13917
13918 .. note::
13919
13920   The metadata is currently in development and is subject to major
13921   changes. Only the current version is supported. *When this document
13922   was generated the version was 2.6.*
13923
13924 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
13925 record (see :ref:`amdgpu-note-records-v3-onwards`).
13926
13927 The metadata is represented as Message Pack formatted binary data (see
13928 [MsgPack]_). The top level is a Message Pack map that includes the keys
13929 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
13930 and referenced tables.
13931
13932 Additional information can be added to the maps. To avoid conflicts, any
13933 key names should be prefixed by "*vendor-name*." where ``vendor-name``
13934 can be the name of the vendor and specific vendor tool that generates the
13935 information. The prefix is abbreviated to simply "." when it appears
13936 within a map that has been added by the same *vendor-name*.
13937
13938   .. table:: AMDPAL Code Object Metadata Map
13939      :name: amdgpu-amdpal-code-object-metadata-map-table
13940
13941      =================== ============== ========= ======================================================================
13942      String Key          Value Type     Required? Description
13943      =================== ============== ========= ======================================================================
13944      "amdpal.version"    sequence of    Required  PAL code object metadata (major, minor) version. The current values
13945                          2 integers               are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
13946      "amdpal.pipelines"  sequence of    Required  Per-pipeline metadata. See
13947                          map                      :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
13948                                                   definition of the keys included in that map.
13949      =================== ============== ========= ======================================================================
13950
13951 ..
13952
13953   .. table:: AMDPAL Code Object Pipeline Metadata Map
13954      :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
13955
13956      ====================================== ============== ========= ===================================================
13957      String Key                             Value Type     Required? Description
13958      ====================================== ============== ========= ===================================================
13959      ".name"                                string                   Source name of the pipeline.
13960      ".type"                                string                   Pipeline type, e.g. VsPs. Values include:
13961
13962                                                                        - "VsPs"
13963                                                                        - "Gs"
13964                                                                        - "Cs"
13965                                                                        - "Ngg"
13966                                                                        - "Tess"
13967                                                                        - "GsTess"
13968                                                                        - "NggTess"
13969
13970      ".internal_pipeline_hash"              sequence of    Required  Internal compiler hash for this pipeline. Lower
13971                                             2 integers               64 bits is the "stable" portion of the hash, used
13972                                                                      for e.g. shader replacement lookup. Upper 64 bits
13973                                                                      is the "unique" portion of the hash, used for
13974                                                                      e.g. pipeline cache lookup. The value is
13975                                                                      implementation defined, and can not be relied on
13976                                                                      between different builds of the compiler.
13977      ".shaders"                             map                      Per-API shader metadata. See
13978                                                                      :ref:`amdgpu-amdpal-code-object-shader-map-table`
13979                                                                      for the definition of the keys included in that
13980                                                                      map.
13981      ".hardware_stages"                     map                      Per-hardware stage metadata. See
13982                                                                      :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
13983                                                                      for the definition of the keys included in that
13984                                                                      map.
13985      ".shader_functions"                    map                      Per-shader function metadata. See
13986                                                                      :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
13987                                                                      for the definition of the keys included in that
13988                                                                      map.
13989      ".registers"                           map            Required  Hardware register configuration. See
13990                                                                      :ref:`amdgpu-amdpal-code-object-register-map-table`
13991                                                                      for the definition of the keys included in that
13992                                                                      map.
13993      ".user_data_limit"                     integer                  Number of user data entries accessed by this
13994                                                                      pipeline.
13995      ".spill_threshold"                     integer                  The user data spill threshold.  0xFFFF for
13996                                                                      NoUserDataSpilling.
13997      ".uses_viewport_array_index"           boolean                  Indicates whether or not the pipeline uses the
13998                                                                      viewport array index feature. Pipelines which use
13999                                                                      this feature can render into all 16 viewports,
14000                                                                      whereas pipelines which do not use it are
14001                                                                      restricted to viewport #0.
14002      ".es_gs_lds_size"                      integer                  Size in bytes of LDS space used internally for
14003                                                                      handling data-passing between the ES and GS
14004                                                                      shader stages. This can be zero if the data is
14005                                                                      passed using off-chip buffers. This value should
14006                                                                      be used to program all user-SGPRs which have been
14007                                                                      marked with "UserDataMapping::EsGsLdsSize"
14008                                                                      (typically only the GS and VS HW stages will ever
14009                                                                      have a user-SGPR so marked).
14010      ".nggSubgroupSize"                     integer                  Explicit maximum subgroup size for NGG shaders
14011                                                                      (maximum number of threads in a subgroup).
14012      ".num_interpolants"                    integer                  Graphics only. Number of PS interpolants.
14013      ".mesh_scratch_memory_size"            integer                  Max mesh shader scratch memory used.
14014      ".api"                                 string                   Name of the client graphics API.
14015      ".api_create_info"                     binary                   Graphics API shader create info binary blob. Can
14016                                                                      be defined by the driver using the compiler if
14017                                                                      they want to be able to correlate API-specific
14018                                                                      information used during creation at a later time.
14019      ====================================== ============== ========= ===================================================
14020
14021 ..
14022
14023   .. table:: AMDPAL Code Object Shader Map
14024      :name: amdgpu-amdpal-code-object-shader-map-table
14025
14026
14027      +-------------+--------------+-------------------------------------------------------------------+
14028      |String Key   |Value Type    |Description                                                        |
14029      +=============+==============+===================================================================+
14030      |- ".compute" |map           |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
14031      |- ".vertex"  |              |for the definition of the keys included in that map.               |
14032      |- ".hull"    |              |                                                                   |
14033      |- ".domain"  |              |                                                                   |
14034      |- ".geometry"|              |                                                                   |
14035      |- ".pixel"   |              |                                                                   |
14036      +-------------+--------------+-------------------------------------------------------------------+
14037
14038 ..
14039
14040   .. table:: AMDPAL Code Object API Shader Metadata Map
14041      :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
14042
14043      ==================== ============== ========= =====================================================================
14044      String Key           Value Type     Required? Description
14045      ==================== ============== ========= =====================================================================
14046      ".api_shader_hash"   sequence of    Required  Input shader hash, typically passed in from the client. The value
14047                           2 integers               is implementation defined, and can not be relied on between
14048                                                    different builds of the compiler.
14049      ".hardware_mapping"  sequence of    Required  Flags indicating the HW stages this API shader maps to. Values
14050                           string                   include:
14051
14052                                                      - ".ls"
14053                                                      - ".hs"
14054                                                      - ".es"
14055                                                      - ".gs"
14056                                                      - ".vs"
14057                                                      - ".ps"
14058                                                      - ".cs"
14059
14060      ==================== ============== ========= =====================================================================
14061
14062 ..
14063
14064   .. table:: AMDPAL Code Object Hardware Stage Map
14065      :name: amdgpu-amdpal-code-object-hardware-stage-map-table
14066
14067      +-------------+--------------+-----------------------------------------------------------------------+
14068      |String Key   |Value Type    |Description                                                            |
14069      +=============+==============+=======================================================================+
14070      |- ".ls"      |map           |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
14071      |- ".hs"      |              |for the definition of the keys included in that map.                   |
14072      |- ".es"      |              |                                                                       |
14073      |- ".gs"      |              |                                                                       |
14074      |- ".vs"      |              |                                                                       |
14075      |- ".ps"      |              |                                                                       |
14076      |- ".cs"      |              |                                                                       |
14077      +-------------+--------------+-----------------------------------------------------------------------+
14078
14079 ..
14080
14081   .. table:: AMDPAL Code Object Hardware Stage Metadata Map
14082      :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
14083
14084      ========================== ============== ========= ===============================================================
14085      String Key                 Value Type     Required? Description
14086      ========================== ============== ========= ===============================================================
14087      ".entry_point"             string                   The ELF symbol pointing to this pipeline's stage entry point.
14088      ".scratch_memory_size"     integer                  Scratch memory size in bytes.
14089      ".lds_size"                integer                  Local Data Share size in bytes.
14090      ".perf_data_buffer_size"   integer                  Performance data buffer size in bytes.
14091      ".vgpr_count"              integer                  Number of VGPRs used.
14092      ".agpr_count"              integer                  Number of AGPRs used.
14093      ".sgpr_count"              integer                  Number of SGPRs used.
14094      ".vgpr_limit"              integer                  If non-zero, indicates the shader was compiled with a
14095                                                          directive to instruct the compiler to limit the VGPR usage to
14096                                                          be less than or equal to the specified value (only set if
14097                                                          different from HW default).
14098      ".sgpr_limit"              integer                  SGPR count upper limit (only set if different from HW
14099                                                          default).
14100      ".threadgroup_dimensions"  sequence of              Thread-group X/Y/Z dimensions (Compute only).
14101                                 3 integers
14102      ".wavefront_size"          integer                  Wavefront size (only set if different from HW default).
14103      ".uses_uavs"               boolean                  The shader reads or writes UAVs.
14104      ".uses_rovs"               boolean                  The shader reads or writes ROVs.
14105      ".writes_uavs"             boolean                  The shader writes to one or more UAVs.
14106      ".writes_depth"            boolean                  The shader writes out a depth value.
14107      ".uses_append_consume"     boolean                  The shader uses append and/or consume operations, either
14108                                                          memory or GDS.
14109      ".uses_prim_id"            boolean                  The shader uses PrimID.
14110      ========================== ============== ========= ===============================================================
14111
14112 ..
14113
14114   .. table:: AMDPAL Code Object Shader Function Map
14115      :name: amdgpu-amdpal-code-object-shader-function-map-table
14116
14117      =============== ============== ====================================================================
14118      String Key      Value Type     Description
14119      =============== ============== ====================================================================
14120      *symbol name*   map            *symbol name* is the ELF symbol name of the shader function code
14121                                     entry address. The value is the function's metadata. See
14122                                     :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
14123      =============== ============== ====================================================================
14124
14125 ..
14126
14127   .. table:: AMDPAL Code Object Shader Function Metadata Map
14128      :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
14129
14130      ============================= ============== =================================================================
14131      String Key                    Value Type     Description
14132      ============================= ============== =================================================================
14133      ".api_shader_hash"            sequence of    Input shader hash, typically passed in from the client. The value
14134                                    2 integers     is implementation defined, and can not be relied on between
14135                                                   different builds of the compiler.
14136      ".scratch_memory_size"        integer        Size in bytes of scratch memory used by the shader.
14137      ".lds_size"                   integer        Size in bytes of LDS memory.
14138      ".vgpr_count"                 integer        Number of VGPRs used by the shader.
14139      ".sgpr_count"                 integer        Number of SGPRs used by the shader.
14140      ".stack_frame_size_in_bytes"  integer        Amount of stack size used by the shader.
14141      ".shader_subtype"             string         Shader subtype/kind. Values include:
14142
14143                                                     - "Unknown"
14144
14145      ============================= ============== =================================================================
14146
14147 ..
14148
14149   .. table:: AMDPAL Code Object Register Map
14150      :name: amdgpu-amdpal-code-object-register-map-table
14151
14152      ========================== ============== ====================================================================
14153      32-bit Integer Key         Value Type     Description
14154      ========================== ============== ====================================================================
14155      ``reg offset``             32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
14156                                                a GRBM register (i.e., driver accessible GPU register number, not
14157                                                shader GPR register number). The driver is required to program each
14158                                                specified register to the corresponding specified value when
14159                                                executing this pipeline. Typically, the ``reg offsets`` are the
14160                                                ``uint16_t`` offsets to each register as defined by the hardware
14161                                                chip headers. The register is set to the provided value. However, a
14162                                                ``reg offset`` that specifies a user data register (e.g.,
14163                                                COMPUTE_USER_DATA_0) needs special treatment. See
14164                                                :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
14165                                                information.
14166      ========================== ============== ====================================================================
14167
14168 .. _amdgpu-amdpal-code-object-user-data-section:
14169
14170 User Data
14171 +++++++++
14172
14173 Each hardware stage has a set of 32-bit physical SPI *user data registers*
14174 (either 16 or 32 based on graphics IP and the stage) which can be
14175 written from a command buffer and then loaded into SGPRs when waves are
14176 launched via a subsequent dispatch or draw operation. This is the way
14177 most arguments are passed from the application/runtime to a hardware
14178 shader.
14179
14180 PAL abstracts this functionality by exposing a set of 128 *user data
14181 entries* per pipeline a client can use to pass arguments from a command
14182 buffer to one or more shaders in that pipeline. The ELF code object must
14183 specify a mapping from virtualized *user data entries* to physical *user
14184 data registers*, and PAL is responsible for implementing that mapping,
14185 including spilling overflow *user data entries* to memory if needed.
14186
14187 Since the *user data registers* are GRBM-accessible SPI registers, this
14188 mapping is actually embedded in the ``.registers`` metadata entry. For
14189 most registers, the value in that map is a literal 32-bit value that
14190 should be written to the register by the driver. However, when the
14191 register is a *user data register* (any USER_DATA register e.g.,
14192 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
14193 the driver to write either a *user data entry* value or one of several
14194 driver-internal values to the register. This encoding is described in
14195 the following table:
14196
14197 .. note::
14198
14199   Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
14200   and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
14201   always be programmed to the address of the GlobalTable, and *user data
14202   register* 1 must always be programmed to the address of the PerShaderTable.
14203
14204 ..
14205
14206   .. table:: AMDPAL User Data Mapping
14207      :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14208
14209      ==========  =================  ===============================================================================
14210      Value       Name               Description
14211      ==========  =================  ===============================================================================
14212      0..127      *User Data Entry*  32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14213      0x10000000  GlobalTable        32-bit pointer to GPU memory containing the global internal table (should
14214                                     always point to *user data register* 0).
14215      0x10000001  PerShaderTable     32-bit pointer to GPU memory containing the per-shader internal table. See
14216                                     :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14217                                     for more detail (should always point to *user data register* 1).
14218      0x10000002  SpillTable         32-bit pointer to GPU memory containing the user data spill table. See
14219                                     :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14220                                     more detail.
14221      0x10000003  BaseVertex         Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14222                                     reference the draw index in the vertex shader. Only supported by the first
14223                                     stage in a graphics pipeline.
14224      0x10000004  BaseInstance       Instance offset (32-bit unsigned integer). Only supported by the first stage in
14225                                     a graphics pipeline.
14226      0x10000005  DrawIndex          Draw index (32-bit unsigned integer). Only supported by the first stage in a
14227                                     graphics pipeline.
14228      0x10000006  Workgroup          Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
14229                                     a buffer containing the grid dimensions for a Compute dispatch operation. The
14230                                     high half of the address is stored in the next sequential user-SGPR. Only
14231                                     supported by compute pipelines.
14232      0x1000000A  EsGsLdsSize        Indicates that PAL will program this user-SGPR to contain the amount of LDS
14233                                     space used for the ES/GS pseudo-ring-buffer for passing data between shader
14234                                     stages.
14235      0x1000000B  ViewId             View id (32-bit unsigned integer) identifies a view of graphic
14236                                     pipeline instancing.
14237      0x1000000C  StreamOutTable     32-bit pointer to GPU memory containing the stream out target SRD table.  This
14238                                     can only appear for one shader stage per pipeline.
14239      0x1000000D  PerShaderPerfData  32-bit pointer to GPU memory containing the per-shader performance data buffer.
14240      0x1000000F  VertexBufferTable  32-bit pointer to GPU memory containing the vertex buffer SRD table.  This can
14241                                     only appear for one shader stage per pipeline.
14242      0x10000010  UavExportTable     32-bit pointer to GPU memory containing the UAV export SRD table.  This can
14243                                     only appear for one shader stage per pipeline (PS). These replace color targets
14244                                     and are completely separate from any UAVs used by the shader. This is optional,
14245                                     and only used by the PS when UAV exports are used to replace color-target
14246                                     exports to optimize specific shaders.
14247      0x10000011  NggCullingData     64-bit pointer to GPU memory containing the hardware register data needed by
14248                                     some NGG pipelines to perform culling.  This value contains the address of the
14249                                     first of two consecutive registers which provide the full GPU address.
14250      0x10000015  FetchShaderPtr     64-bit pointer to GPU memory containing the fetch shader subroutine.
14251      ==========  =================  ===============================================================================
14252
14253 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
14254
14255 Per-Shader Table
14256 ################
14257
14258 Low 32 bits of the GPU address for an optional buffer in the ``.data``
14259 section of the ELF. The high 32 bits of the address match the high 32 bits
14260 of the shader's program counter.
14261
14262 The buffer can be anything the shader compiler needs it for, and
14263 allows each shader to have its own region of the ``.data`` section.
14264 Typically, this could be a table of buffer SRD's and the data pointed to
14265 by the buffer SRD's, but it could be a flat-address region of memory as
14266 well. Its layout and usage are defined by the shader compiler.
14267
14268 Each shader's table in the ``.data`` section is referenced by the symbol
14269 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data``  where *xs* corresponds with the
14270 hardware shader stage the data is for. E.g.,
14271 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
14272
14273 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
14274
14275 Spill Table
14276 ###########
14277
14278 It is possible for a hardware shader to need access to more *user data
14279 entries* than there are slots available in user data registers for one
14280 or more hardware shader stages. In that case, the PAL runtime expects
14281 the necessary *user data entries* to be spilled to GPU memory and use
14282 one user data register to point to the spilled user data memory. The
14283 value of the *user data entry* must then represent the location where
14284 a shader expects to read the low 32-bits of the table's GPU virtual
14285 address. The *spill table* itself represents a set of 32-bit values
14286 managed by the PAL runtime in GPU-accessible memory that can be made
14287 indirectly accessible to a hardware shader.
14288
14289 Unspecified OS
14290 --------------
14291
14292 This section provides code conventions used when the target triple OS is
14293 empty (see :ref:`amdgpu-target-triples`).
14294
14295 Trap Handler ABI
14296 ~~~~~~~~~~~~~~~~
14297
14298 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
14299 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
14300 instructions are handled as follows:
14301
14302   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
14303      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
14304
14305      =============== =============== ===========================================
14306      Usage           Code Sequence   Description
14307      =============== =============== ===========================================
14308      llvm.trap       s_endpgm        Causes wavefront to be terminated.
14309      llvm.debugtrap  *none*          Compiler warning given that there is no
14310                                      trap handler installed.
14311      =============== =============== ===========================================
14312
14313 Source Languages
14314 ================
14315
14316 .. _amdgpu-opencl:
14317
14318 OpenCL
14319 ------
14320
14321 When the language is OpenCL the following differences occur:
14322
14323 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14324 2. The AMDGPU backend appends additional arguments to the kernel's explicit
14325    arguments for the AMDHSA OS (see
14326    :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
14327 3. Additional metadata is generated
14328    (see :ref:`amdgpu-amdhsa-code-object-metadata`).
14329
14330   .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
14331      :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
14332
14333      ======== ==== ========= ===========================================
14334      Position Byte Byte      Description
14335               Size Alignment
14336      ======== ==== ========= ===========================================
14337      1        8    8         OpenCL Global Offset X
14338      2        8    8         OpenCL Global Offset Y
14339      3        8    8         OpenCL Global Offset Z
14340      4        8    8         OpenCL address of printf buffer
14341      5        8    8         OpenCL address of virtual queue used by
14342                              enqueue_kernel.
14343      6        8    8         OpenCL address of AqlWrap struct used by
14344                              enqueue_kernel.
14345      7        8    8         Pointer argument used for Multi-gird
14346                              synchronization.
14347      ======== ==== ========= ===========================================
14348
14349 .. _amdgpu-hcc:
14350
14351 HCC
14352 ---
14353
14354 When the language is HCC the following differences occur:
14355
14356 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14357
14358 .. _amdgpu-assembler:
14359
14360 Assembler
14361 ---------
14362
14363 AMDGPU backend has LLVM-MC based assembler which is currently in development.
14364 It supports AMDGCN GFX6-GFX11.
14365
14366 This section describes general syntax for instructions and operands.
14367
14368 Instructions
14369 ~~~~~~~~~~~~
14370
14371 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
14372
14373   | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
14374     <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
14375
14376 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
14377 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
14378
14379 The order of operands and modifiers is fixed.
14380 Most modifiers are optional and may be omitted.
14381
14382 Links to detailed instruction syntax description may be found in the following
14383 table. Note that features under development are not included
14384 in this description.
14385
14386     ============= ============================================= =======================================
14387     Architecture  Core ISA                                      ISA Variants and Extensions
14388     ============= ============================================= =======================================
14389     GCN 2         :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`             \-
14390     GCN 3, GCN 4  :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`             \-
14391     GCN 5         :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
14392
14393                                                                 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
14394
14395                                                                 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
14396
14397                                                                 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
14398
14399                                                                 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
14400
14401                                                                 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
14402
14403     CDNA 1        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
14404
14405     CDNA 2        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
14406
14407     CDNA 3        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
14408
14409     RDNA 1        :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>`     :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
14410
14411                                                                 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
14412
14413                                                                 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
14414
14415                                                                 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
14416
14417     RDNA 2        :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>`   :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
14418
14419                                                                 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
14420
14421                                                                 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
14422
14423                                                                 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
14424
14425                                                                 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
14426
14427                                                                 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
14428
14429                                                                 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
14430
14431     RDNA 3        :doc:`GFX11<AMDGPU/AMDGPUAsmGFX11>`           :doc:`gfx1100<AMDGPU/AMDGPUAsmGFX11>`
14432
14433                                                                 :doc:`gfx1101<AMDGPU/AMDGPUAsmGFX11>`
14434
14435                                                                 :doc:`gfx1102<AMDGPU/AMDGPUAsmGFX11>`
14436
14437                                                                 :doc:`gfx1103<AMDGPU/AMDGPUAsmGFX11>`
14438     ============= ============================================= =======================================
14439
14440 For more information about instructions, their semantics and supported
14441 combinations of operands, refer to one of instruction set architecture manuals
14442 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
14443 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
14444 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, [AMD-GCN-GFX10-RDNA1]_,
14445 [AMD-GCN-GFX10-RDNA2]_ and [AMD-GCN-GFX11-RDNA3]_.
14446
14447 Operands
14448 ~~~~~~~~
14449
14450 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
14451
14452 Modifiers
14453 ~~~~~~~~~
14454
14455 Detailed description of modifiers may be found
14456 :doc:`here<AMDGPUModifierSyntax>`.
14457
14458 Instruction Examples
14459 ~~~~~~~~~~~~~~~~~~~~
14460
14461 DS
14462 ++
14463
14464 .. code-block:: nasm
14465
14466   ds_add_u32 v2, v4 offset:16
14467   ds_write_src2_b64 v2 offset0:4 offset1:8
14468   ds_cmpst_f32 v2, v4, v6
14469   ds_min_rtn_f64 v[8:9], v2, v[4:5]
14470
14471 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
14472 Manual.
14473
14474 FLAT
14475 ++++
14476
14477 .. code-block:: nasm
14478
14479   flat_load_dword v1, v[3:4]
14480   flat_store_dwordx3 v[3:4], v[5:7]
14481   flat_atomic_swap v1, v[3:4], v5 glc
14482   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
14483   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
14484
14485 For full list of supported instructions, refer to "FLAT instructions" in ISA
14486 Manual.
14487
14488 MUBUF
14489 +++++
14490
14491 .. code-block:: nasm
14492
14493   buffer_load_dword v1, off, s[4:7], s1
14494   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
14495   buffer_store_format_xy v[1:2], off, s[4:7], s1
14496   buffer_wbinvl1
14497   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
14498
14499 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
14500 Manual.
14501
14502 SMRD/SMEM
14503 +++++++++
14504
14505 .. code-block:: nasm
14506
14507   s_load_dword s1, s[2:3], 0xfc
14508   s_load_dwordx8 s[8:15], s[2:3], s4
14509   s_load_dwordx16 s[88:103], s[2:3], s4
14510   s_dcache_inv_vol
14511   s_memtime s[4:5]
14512
14513 For full list of supported instructions, refer to "Scalar Memory Operations" in
14514 ISA Manual.
14515
14516 SOP1
14517 ++++
14518
14519 .. code-block:: nasm
14520
14521   s_mov_b32 s1, s2
14522   s_mov_b64 s[0:1], 0x80000000
14523   s_cmov_b32 s1, 200
14524   s_wqm_b64 s[2:3], s[4:5]
14525   s_bcnt0_i32_b64 s1, s[2:3]
14526   s_swappc_b64 s[2:3], s[4:5]
14527   s_cbranch_join s[4:5]
14528
14529 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
14530 Manual.
14531
14532 SOP2
14533 ++++
14534
14535 .. code-block:: nasm
14536
14537   s_add_u32 s1, s2, s3
14538   s_and_b64 s[2:3], s[4:5], s[6:7]
14539   s_cselect_b32 s1, s2, s3
14540   s_andn2_b32 s2, s4, s6
14541   s_lshr_b64 s[2:3], s[4:5], s6
14542   s_ashr_i32 s2, s4, s6
14543   s_bfm_b64 s[2:3], s4, s6
14544   s_bfe_i64 s[2:3], s[4:5], s6
14545   s_cbranch_g_fork s[4:5], s[6:7]
14546
14547 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
14548 Manual.
14549
14550 SOPC
14551 ++++
14552
14553 .. code-block:: nasm
14554
14555   s_cmp_eq_i32 s1, s2
14556   s_bitcmp1_b32 s1, s2
14557   s_bitcmp0_b64 s[2:3], s4
14558   s_setvskip s3, s5
14559
14560 For full list of supported instructions, refer to "SOPC Instructions" in ISA
14561 Manual.
14562
14563 SOPP
14564 ++++
14565
14566 .. code-block:: nasm
14567
14568   s_barrier
14569   s_nop 2
14570   s_endpgm
14571   s_waitcnt 0 ; Wait for all counters to be 0
14572   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
14573   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
14574   s_sethalt 9
14575   s_sleep 10
14576   s_sendmsg 0x1
14577   s_sendmsg sendmsg(MSG_INTERRUPT)
14578   s_trap 1
14579
14580 For full list of supported instructions, refer to "SOPP Instructions" in ISA
14581 Manual.
14582
14583 Unless otherwise mentioned, little verification is performed on the operands
14584 of SOPP Instructions, so it is up to the programmer to be familiar with the
14585 range or acceptable values.
14586
14587 VALU
14588 ++++
14589
14590 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
14591 the assembler will automatically use optimal encoding based on its operands. To
14592 force specific encoding, one can add a suffix to the opcode of the instruction:
14593
14594 * _e32 for 32-bit VOP1/VOP2/VOPC
14595 * _e64 for 64-bit VOP3
14596 * _dpp for VOP_DPP
14597 * _sdwa for VOP_SDWA
14598
14599 VOP1/VOP2/VOP3/VOPC examples:
14600
14601 .. code-block:: nasm
14602
14603   v_mov_b32 v1, v2
14604   v_mov_b32_e32 v1, v2
14605   v_nop
14606   v_cvt_f64_i32_e32 v[1:2], v2
14607   v_floor_f32_e32 v1, v2
14608   v_bfrev_b32_e32 v1, v2
14609   v_add_f32_e32 v1, v2, v3
14610   v_mul_i32_i24_e64 v1, v2, 3
14611   v_mul_i32_i24_e32 v1, -3, v3
14612   v_mul_i32_i24_e32 v1, -100, v3
14613   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
14614   v_max_f16_e32 v1, v2, v3
14615
14616 VOP_DPP examples:
14617
14618 .. code-block:: nasm
14619
14620   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
14621   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14622   v_mov_b32 v0, v0 wave_shl:1
14623   v_mov_b32 v0, v0 row_mirror
14624   v_mov_b32 v0, v0 row_bcast:31
14625   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
14626   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14627   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14628
14629 VOP_SDWA examples:
14630
14631 .. code-block:: nasm
14632
14633   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
14634   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
14635   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
14636   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
14637   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
14638
14639 For full list of supported instructions, refer to "Vector ALU instructions".
14640
14641 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
14642
14643 Code Object V2 Predefined Symbols
14644 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14645
14646 .. warning::
14647   Code object V2 is not the default code object version emitted by
14648   this version of LLVM.
14649
14650 The AMDGPU assembler defines and updates some symbols automatically. These
14651 symbols do not affect code generation.
14652
14653 .option.machine_version_major
14654 +++++++++++++++++++++++++++++
14655
14656 Set to the GFX major generation number of the target being assembled for. For
14657 example, when assembling for a "GFX9" target this will be set to the integer
14658 value "9". The possible GFX major generation numbers are presented in
14659 :ref:`amdgpu-processors`.
14660
14661 .option.machine_version_minor
14662 +++++++++++++++++++++++++++++
14663
14664 Set to the GFX minor generation number of the target being assembled for. For
14665 example, when assembling for a "GFX810" target this will be set to the integer
14666 value "1". The possible GFX minor generation numbers are presented in
14667 :ref:`amdgpu-processors`.
14668
14669 .option.machine_version_stepping
14670 ++++++++++++++++++++++++++++++++
14671
14672 Set to the GFX stepping generation number of the target being assembled for.
14673 For example, when assembling for a "GFX704" target this will be set to the
14674 integer value "4". The possible GFX stepping generation numbers are presented
14675 in :ref:`amdgpu-processors`.
14676
14677 .kernel.vgpr_count
14678 ++++++++++++++++++
14679
14680 Set to zero each time a
14681 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14682 encountered. At each instruction, if the current value of this symbol is less
14683 than or equal to the maximum VGPR number explicitly referenced within that
14684 instruction then the symbol value is updated to equal that VGPR number plus
14685 one.
14686
14687 .kernel.sgpr_count
14688 ++++++++++++++++++
14689
14690 Set to zero each time a
14691 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14692 encountered. At each instruction, if the current value of this symbol is less
14693 than or equal to the maximum VGPR number explicitly referenced within that
14694 instruction then the symbol value is updated to equal that SGPR number plus
14695 one.
14696
14697 .. _amdgpu-amdhsa-assembler-directives-v2:
14698
14699 Code Object V2 Directives
14700 ~~~~~~~~~~~~~~~~~~~~~~~~~
14701
14702 .. warning::
14703   Code object V2 is not the default code object version emitted by
14704   this version of LLVM.
14705
14706 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
14707 one can specify them with assembler directives.
14708
14709 .hsa_code_object_version major, minor
14710 +++++++++++++++++++++++++++++++++++++
14711
14712 *major* and *minor* are integers that specify the version of the HSA code
14713 object that will be generated by the assembler.
14714
14715 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
14716 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
14717
14718
14719 *major*, *minor*, and *stepping* are all integers that describe the instruction
14720 set architecture (ISA) version of the assembly program.
14721
14722 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
14723 "AMD" and *arch* should always be equal to "AMDGPU".
14724
14725 By default, the assembler will derive the ISA version, *vendor*, and *arch*
14726 from the value of the -mcpu option that is passed to the assembler.
14727
14728 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
14729
14730 .amdgpu_hsa_kernel (name)
14731 +++++++++++++++++++++++++
14732
14733 This directives specifies that the symbol with given name is a kernel entry
14734 point (label) and the object should contain corresponding symbol of type
14735 STT_AMDGPU_HSA_KERNEL.
14736
14737 .amd_kernel_code_t
14738 ++++++++++++++++++
14739
14740 This directive marks the beginning of a list of key / value pairs that are used
14741 to specify the amd_kernel_code_t object that will be emitted by the assembler.
14742 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
14743 amd_kernel_code_t values that are unspecified a default value will be used. The
14744 default value for all keys is 0, with the following exceptions:
14745
14746 - *amd_code_version_major* defaults to 1.
14747 - *amd_kernel_code_version_minor* defaults to 2.
14748 - *amd_machine_kind* defaults to 1.
14749 - *amd_machine_version_major*, *machine_version_minor*, and
14750   *amd_machine_version_stepping* are derived from the value of the -mcpu option
14751   that is passed to the assembler.
14752 - *kernel_code_entry_byte_offset* defaults to 256.
14753 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
14754   defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
14755   Note that wavefront size is specified as a power of two, so a value of **n**
14756   means a size of 2^ **n**.
14757 - *call_convention* defaults to -1.
14758 - *kernarg_segment_alignment*, *group_segment_alignment*, and
14759   *private_segment_alignment* default to 4. Note that alignments are specified
14760   as a power of 2, so a value of **n** means an alignment of 2^ **n**.
14761 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
14762   GFX90A onwards.
14763 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
14764   GFX10 onwards.
14765 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
14766
14767 The *.amd_kernel_code_t* directive must be placed immediately after the
14768 function label and before any instructions.
14769
14770 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
14771 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
14772
14773 .. _amdgpu-amdhsa-assembler-example-v2:
14774
14775 Code Object V2 Example Source Code
14776 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14777
14778 .. warning::
14779   Code Object V2 is not the default code object version emitted by
14780   this version of LLVM.
14781
14782 Here is an example of a minimal assembly source file, defining one HSA kernel:
14783
14784 .. code::
14785    :number-lines:
14786
14787    .hsa_code_object_version 1,0
14788    .hsa_code_object_isa
14789
14790    .hsatext
14791    .globl  hello_world
14792    .p2align 8
14793    .amdgpu_hsa_kernel hello_world
14794
14795    hello_world:
14796
14797       .amd_kernel_code_t
14798          enable_sgpr_kernarg_segment_ptr = 1
14799          is_ptr64 = 1
14800          compute_pgm_rsrc1_vgprs = 0
14801          compute_pgm_rsrc1_sgprs = 0
14802          compute_pgm_rsrc2_user_sgpr = 2
14803          compute_pgm_rsrc1_wgp_mode = 0
14804          compute_pgm_rsrc1_mem_ordered = 0
14805          compute_pgm_rsrc1_fwd_progress = 1
14806      .end_amd_kernel_code_t
14807
14808      s_load_dwordx2 s[0:1], s[0:1] 0x0
14809      v_mov_b32 v0, 3.14159
14810      s_waitcnt lgkmcnt(0)
14811      v_mov_b32 v1, s0
14812      v_mov_b32 v2, s1
14813      flat_store_dword v[1:2], v0
14814      s_endpgm
14815    .Lfunc_end0:
14816         .size   hello_world, .Lfunc_end0-hello_world
14817
14818 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
14819
14820 Code Object V3 and Above Predefined Symbols
14821 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14822
14823 The AMDGPU assembler defines and updates some symbols automatically. These
14824 symbols do not affect code generation.
14825
14826 .amdgcn.gfx_generation_number
14827 +++++++++++++++++++++++++++++
14828
14829 Set to the GFX major generation number of the target being assembled for. For
14830 example, when assembling for a "GFX9" target this will be set to the integer
14831 value "9". The possible GFX major generation numbers are presented in
14832 :ref:`amdgpu-processors`.
14833
14834 .amdgcn.gfx_generation_minor
14835 ++++++++++++++++++++++++++++
14836
14837 Set to the GFX minor generation number of the target being assembled for. For
14838 example, when assembling for a "GFX810" target this will be set to the integer
14839 value "1". The possible GFX minor generation numbers are presented in
14840 :ref:`amdgpu-processors`.
14841
14842 .amdgcn.gfx_generation_stepping
14843 +++++++++++++++++++++++++++++++
14844
14845 Set to the GFX stepping generation number of the target being assembled for.
14846 For example, when assembling for a "GFX704" target this will be set to the
14847 integer value "4". The possible GFX stepping generation numbers are presented
14848 in :ref:`amdgpu-processors`.
14849
14850 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
14851
14852 .amdgcn.next_free_vgpr
14853 ++++++++++++++++++++++
14854
14855 Set to zero before assembly begins. At each instruction, if the current value
14856 of this symbol is less than or equal to the maximum VGPR number explicitly
14857 referenced within that instruction then the symbol value is updated to equal
14858 that VGPR number plus one.
14859
14860 May be used to set the `.amdhsa_next_free_vgpr` directive in
14861 :ref:`amdhsa-kernel-directives-table`.
14862
14863 May be set at any time, e.g. manually set to zero at the start of each kernel.
14864
14865 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
14866
14867 .amdgcn.next_free_sgpr
14868 ++++++++++++++++++++++
14869
14870 Set to zero before assembly begins. At each instruction, if the current value
14871 of this symbol is less than or equal the maximum SGPR number explicitly
14872 referenced within that instruction then the symbol value is updated to equal
14873 that SGPR number plus one.
14874
14875 May be used to set the `.amdhsa_next_free_spgr` directive in
14876 :ref:`amdhsa-kernel-directives-table`.
14877
14878 May be set at any time, e.g. manually set to zero at the start of each kernel.
14879
14880 .. _amdgpu-amdhsa-assembler-directives-v3-onwards:
14881
14882 Code Object V3 and Above Directives
14883 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14884
14885 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
14886 architecture processors, and are not OS-specific. Directives which begin with
14887 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
14888 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
14889 :ref:`amdgpu-processors`.
14890
14891 .. _amdgpu-assembler-directive-amdgcn-target:
14892
14893 .amdgcn_target <target-triple> "-" <target-id>
14894 ++++++++++++++++++++++++++++++++++++++++++++++
14895
14896 Optional directive which declares the ``<target-triple>-<target-id>`` supported
14897 by the containing assembler source file. Used by the assembler to validate
14898 command-line options such as ``-triple``, ``-mcpu``, and
14899 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
14900 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
14901
14902 .. note::
14903
14904   The target ID syntax used for code object V2 to V3 for this directive differs
14905   from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
14906
14907 .amdhsa_kernel <name>
14908 +++++++++++++++++++++
14909
14910 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
14911 ``<name>.kd``, in the current location of the current section. Only valid when
14912 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
14913 instruction to execute, and does not need to be previously defined.
14914
14915 Marks the beginning of a list of directives used to generate the bytes of a
14916 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
14917 Directives which may appear in this list are described in
14918 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
14919 be valid for the target being assembled for, and cannot be repeated. Directives
14920 support the range of values specified by the field they reference in
14921 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
14922 assumed to have its default value, unless it is marked as "Required", in which
14923 case it is an error to omit the directive. This list of directives is
14924 terminated by an ``.end_amdhsa_kernel`` directive.
14925
14926   .. table:: AMDHSA Kernel Assembler Directives
14927      :name: amdhsa-kernel-directives-table
14928
14929      ======================================================== =================== ============ ===================
14930      Directive                                                Default             Supported On Description
14931      ======================================================== =================== ============ ===================
14932      ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX11   Controls GROUP_SEGMENT_FIXED_SIZE in
14933                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14934      ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX11   Controls PRIVATE_SEGMENT_FIXED_SIZE in
14935                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14936      ``.amdhsa_kernarg_size``                                 0                   GFX6-GFX11   Controls KERNARG_SIZE in
14937                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14938      ``.amdhsa_user_sgpr_count``                              0                   GFX6-GFX11   Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
14939                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`
14940      ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
14941                                                                                   (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14942                                                                                   GFX940)
14943      ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX11   Controls ENABLE_SGPR_DISPATCH_PTR in
14944                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14945      ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX11   Controls ENABLE_SGPR_QUEUE_PTR in
14946                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14947      ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX11   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
14948                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14949      ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX11   Controls ENABLE_SGPR_DISPATCH_ID in
14950                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14951      ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
14952                                                                                   (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14953                                                                                   GFX940)
14954      ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX11   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
14955                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14956      ``.amdhsa_wavefront_size32``                             Target              GFX10-GFX11  Controls ENABLE_WAVEFRONT_SIZE32 in
14957                                                               Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14958                                                               Specific
14959                                                               (wavefrontsize64)
14960      ``.amdhsa_uses_dynamic_stack``                           0                   GFX6-GFX11   Controls USES_DYNAMIC_STACK in
14961                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14962      ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_PRIVATE_SEGMENT in
14963                                                                                   (except      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14964                                                                                   GFX940)
14965      ``.amdhsa_enable_private_segment``                       0                   GFX940,      Controls ENABLE_PRIVATE_SEGMENT in
14966                                                                                   GFX11        :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14967      ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_X in
14968                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14969      ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
14970                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14971      ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
14972                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14973      ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_INFO in
14974                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14975      ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX11   Controls ENABLE_VGPR_WORKITEM_ID in
14976                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14977                                                                                                Possible values are defined in
14978                                                                                                :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
14979      ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX11   Maximum VGPR number explicitly referenced, plus one.
14980                                                                                                Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
14981                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14982      ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX11   Maximum SGPR number explicitly referenced, plus one.
14983                                                                                                Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14984                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14985      ``.amdhsa_accum_offset``                                 Required            GFX90A,      Offset of a first AccVGPR in the unified register file.
14986                                                                                   GFX940       Used to calculate ACCUM_OFFSET in
14987                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14988      ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX11   Whether the kernel may use the special VCC SGPR.
14989                                                                                                Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14990                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14991      ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
14992                                                                                   (except      scratch memory. Used to calculate
14993                                                                                   GFX940)      GRANULATED_WAVEFRONT_SGPR_COUNT in
14994                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14995      ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
14996                                                               Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14997                                                               Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14998                                                               (xnack)
14999      ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX11   Controls FLOAT_ROUND_MODE_32 in
15000                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15001                                                                                                Possible values are defined in
15002                                                                                                :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15003      ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX11   Controls FLOAT_ROUND_MODE_16_64 in
15004                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15005                                                                                                Possible values are defined in
15006                                                                                                :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15007      ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX11   Controls FLOAT_DENORM_MODE_32 in
15008                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15009                                                                                                Possible values are defined in
15010                                                                                                :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15011      ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX11   Controls FLOAT_DENORM_MODE_16_64 in
15012                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15013                                                                                                Possible values are defined in
15014                                                                                                :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15015      ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX11   Controls ENABLE_DX10_CLAMP in
15016                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15017      ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX11   Controls ENABLE_IEEE_MODE in
15018                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15019      ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX11   Controls FP16_OVFL in
15020                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15021      ``.amdhsa_tg_split``                                     Target              GFX90A,      Controls TG_SPLIT in
15022                                                               Feature             GFX940,      :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
15023                                                               Specific            GFX11
15024                                                               (tgsplit)
15025      ``.amdhsa_workgroup_processor_mode``                     Target              GFX10-GFX11  Controls ENABLE_WGP_MODE in
15026                                                               Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15027                                                               Specific
15028                                                               (cumode)
15029      ``.amdhsa_memory_ordered``                               1                   GFX10-GFX11  Controls MEM_ORDERED in
15030                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15031      ``.amdhsa_forward_progress``                             0                   GFX10-GFX11  Controls FWD_PROGRESS in
15032                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15033      ``.amdhsa_shared_vgpr_count``                            0                   GFX10-GFX11  Controls SHARED_VGPR_COUNT in
15034                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
15035      ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
15036                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15037      ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
15038                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15039      ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
15040                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15041      ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
15042                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15043      ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
15044                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15045      ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
15046                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15047      ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
15048                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15049      ======================================================== =================== ============ ===================
15050
15051 .amdgpu_metadata
15052 ++++++++++++++++
15053
15054 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
15055 note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
15056
15057 The contents must be in the [YAML]_ markup format, with the same structure and
15058 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
15059 :ref:`amdgpu-amdhsa-code-object-metadata-v4` or
15060 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
15061
15062 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
15063
15064 .. _amdgpu-amdhsa-assembler-example-v3-onwards:
15065
15066 Code Object V3 and Above Example Source Code
15067 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15068
15069 Here is an example of a minimal assembly source file, defining one HSA kernel:
15070
15071 .. code::
15072    :number-lines:
15073
15074    .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15075
15076    .text
15077    .globl hello_world
15078    .p2align 8
15079    .type hello_world,@function
15080    hello_world:
15081      s_load_dwordx2 s[0:1], s[0:1] 0x0
15082      v_mov_b32 v0, 3.14159
15083      s_waitcnt lgkmcnt(0)
15084      v_mov_b32 v1, s0
15085      v_mov_b32 v2, s1
15086      flat_store_dword v[1:2], v0
15087      s_endpgm
15088    .Lfunc_end0:
15089      .size   hello_world, .Lfunc_end0-hello_world
15090
15091    .rodata
15092    .p2align 6
15093    .amdhsa_kernel hello_world
15094      .amdhsa_user_sgpr_kernarg_segment_ptr 1
15095      .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15096      .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15097    .end_amdhsa_kernel
15098
15099    .amdgpu_metadata
15100    ---
15101    amdhsa.version:
15102      - 1
15103      - 0
15104    amdhsa.kernels:
15105      - .name: hello_world
15106        .symbol: hello_world.kd
15107        .kernarg_segment_size: 48
15108        .group_segment_fixed_size: 0
15109        .private_segment_fixed_size: 0
15110        .kernarg_segment_align: 4
15111        .wavefront_size: 64
15112        .sgpr_count: 2
15113        .vgpr_count: 3
15114        .max_flat_workgroup_size: 256
15115        .args:
15116          - .size: 8
15117            .offset: 0
15118            .value_kind: global_buffer
15119            .address_space: global
15120            .actual_access: write_only
15121    //...
15122    .end_amdgpu_metadata
15123
15124 This kernel is equivalent to the following HIP program:
15125
15126 .. code::
15127    :number-lines:
15128
15129    __global__ void hello_world(float *p) {
15130        *p = 3.14159f;
15131    }
15132
15133 If an assembly source file contains multiple kernels and/or functions, the
15134 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
15135 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
15136 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
15137 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
15138 to group the function with the kernel that calls it and reset the symbols
15139 between the two connected components:
15140
15141 .. code::
15142    :number-lines:
15143
15144    .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15145
15146    // gpr tracking symbols are implicitly set to zero
15147
15148    .text
15149    .globl kern0
15150    .p2align 8
15151    .type kern0,@function
15152    kern0:
15153      // ...
15154      s_endpgm
15155    .Lkern0_end:
15156      .size   kern0, .Lkern0_end-kern0
15157
15158    .rodata
15159    .p2align 6
15160    .amdhsa_kernel kern0
15161      // ...
15162      .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15163      .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15164    .end_amdhsa_kernel
15165
15166    // reset symbols to begin tracking usage in func1 and kern1
15167    .set .amdgcn.next_free_vgpr, 0
15168    .set .amdgcn.next_free_sgpr, 0
15169
15170    .text
15171    .hidden func1
15172    .global func1
15173    .p2align 2
15174    .type func1,@function
15175    func1:
15176      // ...
15177      s_setpc_b64 s[30:31]
15178    .Lfunc1_end:
15179    .size func1, .Lfunc1_end-func1
15180
15181    .globl kern1
15182    .p2align 8
15183    .type kern1,@function
15184    kern1:
15185      // ...
15186      s_getpc_b64 s[4:5]
15187      s_add_u32 s4, s4, func1@rel32@lo+4
15188      s_addc_u32 s5, s5, func1@rel32@lo+4
15189      s_swappc_b64 s[30:31], s[4:5]
15190      // ...
15191      s_endpgm
15192    .Lkern1_end:
15193      .size   kern1, .Lkern1_end-kern1
15194
15195    .rodata
15196    .p2align 6
15197    .amdhsa_kernel kern1
15198      // ...
15199      .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15200      .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15201    .end_amdhsa_kernel
15202
15203 These symbols cannot identify connected components in order to automatically
15204 track the usage for each kernel. However, in some cases careful organization of
15205 the kernels and functions in the source file means there is minimal additional
15206 effort required to accurately calculate GPR usage.
15207
15208 Additional Documentation
15209 ========================
15210
15211 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
15212 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
15213 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
15214 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
15215 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
15216 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
15217 .. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
15218 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
15219 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
15220 .. [AMD-GCN-GFX11-RDNA3] `AMD RDNA 3 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf>`__
15221 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
15222 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
15223 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
15224 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
15225 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
15226 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
15227 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
15228 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
15229 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
15230 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
15231 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
15232 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
15233 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
15234 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
15235 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
15236 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__