llvm/docs/AMDGPUUsage.rst

   1 =============================
   2 User Guide for AMDGPU Backend
   3 =============================
   4
   5 .. contents::
   6    :local:
   7
   8 .. toctree::
   9    :hidden:
  10
  11    AMDGPU/AMDGPUAsmGFX7
  12    AMDGPU/AMDGPUAsmGFX8
  13    AMDGPU/AMDGPUAsmGFX9
  14    AMDGPU/AMDGPUAsmGFX900
  15    AMDGPU/AMDGPUAsmGFX904
  16    AMDGPU/AMDGPUAsmGFX906
  17    AMDGPU/AMDGPUAsmGFX908
  18    AMDGPU/AMDGPUAsmGFX90a
  19    AMDGPU/AMDGPUAsmGFX940
  20    AMDGPU/AMDGPUAsmGFX10
  21    AMDGPU/AMDGPUAsmGFX1011
  22    AMDGPU/AMDGPUAsmGFX1013
  23    AMDGPU/AMDGPUAsmGFX1030
  24    AMDGPU/AMDGPUAsmGFX11
  25    AMDGPUModifierSyntax
  26    AMDGPUOperandSyntax
  27    AMDGPUInstructionSyntax
  28    AMDGPUInstructionNotation
  29    AMDGPUDwarfExtensionsForHeterogeneousDebugging
  30    AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
  31
  32 Introduction
  33 ============
  34
  35 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
  36 R600 family up until the current GCN families. It lives in the
  37 ``llvm/lib/Target/AMDGPU`` directory.
  38
  39 LLVM
  40 ====
  41
  42 .. _amdgpu-target-triples:
  43
  44 Target Triples
  45 --------------
  46
  47 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
  48 to specify the target triple:
  49
  50   .. table:: AMDGPU Architectures
  51      :name: amdgpu-architecture-table
  52
  53      ============ ==============================================================
  54      Architecture Description
  55      ============ ==============================================================
  56      ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
  57      ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
  58      ============ ==============================================================
  59
  60   .. table:: AMDGPU Vendors
  61      :name: amdgpu-vendor-table
  62
  63      ============ ==============================================================
  64      Vendor       Description
  65      ============ ==============================================================
  66      ``amd``      Can be used for all AMD GPU usage.
  67      ``mesa3d``   Can be used if the OS is ``mesa3d``.
  68      ============ ==============================================================
  69
  70   .. table:: AMDGPU Operating Systems
  71      :name: amdgpu-os
  72
  73      ============== ============================================================
  74      OS             Description
  75      ============== ============================================================
  76      *<empty>*      Defaults to the *unknown* OS.
  77      ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
  78                     such as:
  79
  80                     - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
  81                       loader on Linux. See *AMD ROCm Platform Release Notes*
  82                       [AMD-ROCm-Release-Notes]_ for supported hardware and
  83                       software.
  84                     - AMD's PAL runtime using the *pal-amdhsa* loader on
  85                       Windows.
  86
  87      ``amdpal``     Graphic shaders and compute kernels executed on AMD's PAL
  88                     runtime using the *pal-amdpal* loader on Windows and Linux
  89                     Pro.
  90      ``mesa3d``     Graphic shaders and compute kernels executed on AMD's Mesa
  91                     3D runtime using the *mesa-mesa3d* loader on Linux.
  92      ============== ============================================================
  93
  94   .. table:: AMDGPU Environments
  95      :name: amdgpu-environment-table
  96
  97      ============ ==============================================================
  98      Environment  Description
  99      ============ ==============================================================
 100      *<empty>*    Default.
 101      ============ ==============================================================
 102
 103 .. _amdgpu-processors:
 104
 105 Processors
 106 ----------
 107
 108 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
 109 specify the AMDGPU processor together with optional target features. See
 110 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
 111 specific information.
 112
 113 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
 114
 115 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
 116
 117
 118   .. table:: AMDGPU Processors
 119      :name: amdgpu-processor-table
 120
 121      =========== =============== ============ ===== ================= =============== =============== ======================
 122      Processor   Alternative     Target       dGPU/ Target            Target          OS Support      Example
 123                  Processor       Triple       APU   Features          Properties      *(see*          Products
 124                                  Architecture       Supported                         `amdgpu-os`_
 125                                                                                       *and
 126                                                                                       corresponding
 127                                                                                       runtime release
 128                                                                                       notes for
 129                                                                                       current
 130                                                                                       information and
 131                                                                                       level of
 132                                                                                       support)*
 133      =========== =============== ============ ===== ================= =============== =============== ======================
 134      **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
 135      -----------------------------------------------------------------------------------------------------------------------
 136      ``r600``                    ``r600``     dGPU                    - Does not
 137                                                                         support
 138                                                                         generic
 139                                                                         address
 140                                                                         space
 141      ``r630``                    ``r600``     dGPU                    - Does not
 142                                                                         support
 143                                                                         generic
 144                                                                         address
 145                                                                         space
 146      ``rs880``                   ``r600``     dGPU                    - Does not
 147                                                                         support
 148                                                                         generic
 149                                                                         address
 150                                                                         space
 151      ``rv670``                   ``r600``     dGPU                    - Does not
 152                                                                         support
 153                                                                         generic
 154                                                                         address
 155                                                                         space
 156      **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
 157      -----------------------------------------------------------------------------------------------------------------------
 158      ``rv710``                   ``r600``     dGPU                    - Does not
 159                                                                         support
 160                                                                         generic
 161                                                                         address
 162                                                                         space
 163      ``rv730``                   ``r600``     dGPU                    - Does not
 164                                                                         support
 165                                                                         generic
 166                                                                         address
 167                                                                         space
 168      ``rv770``                   ``r600``     dGPU                    - Does not
 169                                                                         support
 170                                                                         generic
 171                                                                         address
 172                                                                         space
 173      **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
 174      -----------------------------------------------------------------------------------------------------------------------
 175      ``cedar``                   ``r600``     dGPU                    - Does not
 176                                                                         support
 177                                                                         generic
 178                                                                         address
 179                                                                         space
 180      ``cypress``                 ``r600``     dGPU                    - Does not
 181                                                                         support
 182                                                                         generic
 183                                                                         address
 184                                                                         space
 185      ``juniper``                 ``r600``     dGPU                    - Does not
 186                                                                         support
 187                                                                         generic
 188                                                                         address
 189                                                                         space
 190      ``redwood``                 ``r600``     dGPU                    - Does not
 191                                                                         support
 192                                                                         generic
 193                                                                         address
 194                                                                         space
 195      ``sumo``                    ``r600``     dGPU                    - Does not
 196                                                                         support
 197                                                                         generic
 198                                                                         address
 199                                                                         space
 200      **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
 201      -----------------------------------------------------------------------------------------------------------------------
 202      ``barts``                   ``r600``     dGPU                    - Does not
 203                                                                         support
 204                                                                         generic
 205                                                                         address
 206                                                                         space
 207      ``caicos``                  ``r600``     dGPU                    - Does not
 208                                                                         support
 209                                                                         generic
 210                                                                         address
 211                                                                         space
 212      ``cayman``                  ``r600``     dGPU                    - Does not
 213                                                                         support
 214                                                                         generic
 215                                                                         address
 216                                                                         space
 217      ``turks``                   ``r600``     dGPU                    - Does not
 218                                                                         support
 219                                                                         generic
 220                                                                         address
 221                                                                         space
 222      **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
 223      -----------------------------------------------------------------------------------------------------------------------
 224      ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
 225                                                                         support
 226                                                                         generic
 227                                                                         address
 228                                                                         space
 229      ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
 230                  - ``verde``                                            support
 231                                                                         generic
 232                                                                         address
 233                                                                         space
 234      ``gfx602``  - ``hainan``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
 235                  - ``oland``                                            support
 236                                                                         generic
 237                                                                         address
 238                                                                         space
 239      **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
 240      -----------------------------------------------------------------------------------------------------------------------
 241      ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - Offset        - *rocm-amdhsa* - A6-7000
 242                                                                         flat          - *pal-amdhsa*  - A6 Pro-7050B
 243                                                                         scratch       - *pal-amdpal*  - A8-7100
 244                                                                                                       - A8 Pro-7150B
 245                                                                                                       - A10-7300
 246                                                                                                       - A10 Pro-7350B
 247                                                                                                       - FX-7500
 248                                                                                                       - A8-7200P
 249                                                                                                       - A10-7400P
 250                                                                                                       - FX-7600P
 251      ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro W8100
 252                                                                         flat          - *pal-amdhsa*  - FirePro W9100
 253                                                                         scratch       - *pal-amdpal*  - FirePro S9150
 254                                                                                                       - FirePro S9170
 255      ``gfx702``                  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 290
 256                                                                         flat          - *pal-amdhsa*  - Radeon R9 290x
 257                                                                         scratch       - *pal-amdpal*  - Radeon R390
 258                                                                                                       - Radeon R390x
 259      ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  - E1-2100
 260                  - ``mullins``                                          flat          - *pal-amdpal*  - E1-2200
 261                                                                         scratch                       - E1-2500
 262                                                                                                       - E2-3000
 263                                                                                                       - E2-3800
 264                                                                                                       - A4-5000
 265                                                                                                       - A4-5100
 266                                                                                                       - A6-5200
 267                                                                                                       - A4 Pro-3340B
 268      ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Offset        - *pal-amdhsa*  - Radeon HD 7790
 269                                                                         flat          - *pal-amdpal*  - Radeon HD 8770
 270                                                                         scratch                       - R7 260
 271                                                                                                       - R7 260X
 272      ``gfx705``                  ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  *TBA*
 273                                                                         flat          - *pal-amdpal*
 274                                                                         scratch                       .. TODO::
 275
 276                                                                                                         Add product
 277                                                                                                         names.
 278
 279      **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
 280      -----------------------------------------------------------------------------------------------------------------------
 281      ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* - A6-8500P
 282                                                                         flat          - *pal-amdhsa*  - Pro A6-8500B
 283                                                                         scratch       - *pal-amdpal*  - A8-8600P
 284                                                                                                       - Pro A8-8600B
 285                                                                                                       - FX-8800P
 286                                                                                                       - Pro A12-8800B
 287                                                                                                       - A10-8700P
 288                                                                                                       - Pro A10-8700B
 289                                                                                                       - A10-8780P
 290                                                                                                       - A10-9600P
 291                                                                                                       - A10-9630P
 292                                                                                                       - A12-9700P
 293                                                                                                       - A12-9730P
 294                                                                                                       - FX-9800P
 295                                                                                                       - FX-9830P
 296                                                                                                       - E2-9010
 297                                                                                                       - A6-9210
 298                                                                                                       - A9-9410
 299      ``gfx802``  - ``iceland``   ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 285
 300                  - ``tonga``                                            flat          - *pal-amdhsa*  - Radeon R9 380
 301                                                                         scratch       - *pal-amdpal*  - Radeon R9 385
 302      ``gfx803``  - ``fiji``      ``amdgcn``   dGPU                                    - *rocm-amdhsa* - Radeon R9 Nano
 303                                                                                       - *pal-amdhsa*  - Radeon R9 Fury
 304                                                                                       - *pal-amdpal*  - Radeon R9 FuryX
 305                                                                                                       - Radeon Pro Duo
 306                                                                                                       - FirePro S9300x2
 307                                                                                                       - Radeon Instinct MI8
 308      \           - ``polaris10`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 470
 309                                                                         flat          - *pal-amdhsa*  - Radeon RX 480
 310                                                                         scratch       - *pal-amdpal*  - Radeon Instinct MI6
 311      \           - ``polaris11`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 460
 312                                                                         flat          - *pal-amdhsa*
 313                                                                         scratch       - *pal-amdpal*
 314      ``gfx805``  - ``tongapro``  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro S7150
 315                                                                         flat          - *pal-amdhsa*  - FirePro S7100
 316                                                                         scratch       - *pal-amdpal*  - FirePro W7100
 317                                                                                                       - Mobile FirePro
 318                                                                                                         M7170
 319      ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* *TBA*
 320                                                                         flat          - *pal-amdhsa*
 321                                                                         scratch       - *pal-amdpal*  .. TODO::
 322
 323                                                                                                         Add product
 324                                                                                                         names.
 325
 326      **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_
 327      -----------------------------------------------------------------------------------------------------------------------
 328      ``gfx900``                  ``amdgcn``   dGPU  - xnack           - Absolute      - *rocm-amdhsa* - Radeon Vega
 329                                                                         flat          - *pal-amdhsa*    Frontier Edition
 330                                                                         scratch       - *pal-amdpal*  - Radeon RX Vega 56
 331                                                                                                       - Radeon RX Vega 64
 332                                                                                                       - Radeon RX Vega 64
 333                                                                                                         Liquid
 334                                                                                                       - Radeon Instinct MI25
 335      ``gfx902``                  ``amdgcn``   APU   - xnack           - Absolute      - *rocm-amdhsa* - Ryzen 3 2200G
 336                                                                         flat          - *pal-amdhsa*  - Ryzen 5 2400G
 337                                                                         scratch       - *pal-amdpal*
 338      ``gfx904``                  ``amdgcn``   dGPU  - xnack                           - *rocm-amdhsa* *TBA*
 339                                                                                       - *pal-amdhsa*
 340                                                                                       - *pal-amdpal*  .. TODO::
 341
 342                                                                                                         Add product
 343                                                                                                         names.
 344
 345      ``gfx906``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* - Radeon Instinct MI50
 346                                                     - xnack             flat          - *pal-amdhsa*  - Radeon Instinct MI60
 347                                                                         scratch       - *pal-amdpal*  - Radeon VII
 348                                                                                                       - Radeon Pro VII
 349      ``gfx908``                  ``amdgcn``   dGPU  - sramecc                         - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
 350                                                     - xnack           - Absolute
 351                                                                         flat
 352                                                                         scratch
 353      ``gfx909``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  *TBA*
 354                                                                         flat
 355                                                                         scratch                       .. TODO::
 356
 357                                                                                                         Add product
 358                                                                                                         names.
 359
 360      ``gfx90a``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* *TBA*
 361                                                     - tgsplit           flat
 362                                                     - xnack             scratch                       .. TODO::
 363                                                                       - Packed
 364                                                                         work-item                       Add product
 365                                                                         IDs                             names.
 366
 367      ``gfx90c``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  - Ryzen 7 4700G
 368                                                                         flat                          - Ryzen 7 4700GE
 369                                                                         scratch                       - Ryzen 5 4600G
 370                                                                                                       - Ryzen 5 4600GE
 371                                                                                                       - Ryzen 3 4300G
 372                                                                                                       - Ryzen 3 4300GE
 373                                                                                                       - Ryzen Pro 4000G
 374                                                                                                       - Ryzen 7 Pro 4700G
 375                                                                                                       - Ryzen 7 Pro 4750GE
 376                                                                                                       - Ryzen 5 Pro 4650G
 377                                                                                                       - Ryzen 5 Pro 4650GE
 378                                                                                                       - Ryzen 3 Pro 4350G
 379                                                                                                       - Ryzen 3 Pro 4350GE
 380
 381      ``gfx940``                  ``amdgcn``   dGPU  - sramecc         - Architected                   *TBA*
 382                                                     - tgsplit           flat
 383                                                     - xnack             scratch                       .. TODO::
 384                                                                       - Packed
 385                                                                         work-item                       Add product
 386                                                                         IDs                             names.
 387
 388      **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
 389      -----------------------------------------------------------------------------------------------------------------------
 390      ``gfx1010``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5700
 391                                                     - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5700 XT
 392                                                     - xnack             scratch       - *pal-amdpal*  - Radeon Pro 5600 XT
 393                                                                                                       - Radeon Pro 5600M
 394      ``gfx1011``                 ``amdgcn``   dGPU  - cumode                          - *rocm-amdhsa* - Radeon Pro V520
 395                                                     - wavefrontsize64 - Absolute      - *pal-amdhsa*
 396                                                     - xnack             flat          - *pal-amdpal*
 397                                                                         scratch
 398      ``gfx1012``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5500
 399                                                     - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5500 XT
 400                                                     - xnack             scratch       - *pal-amdpal*
 401      ``gfx1013``                 ``amdgcn``   APU   - cumode          - Absolute      - *rocm-amdhsa* *TBA*
 402                                                     - wavefrontsize64   flat          - *pal-amdhsa*
 403                                                     - xnack             scratch       - *pal-amdpal*  .. TODO::
 404
 405                                                                                                         Add product
 406                                                                                                         names.
 407
 408      **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
 409      -----------------------------------------------------------------------------------------------------------------------
 410      ``gfx1030``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6800
 411                                                     - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 6800 XT
 412                                                                         scratch       - *pal-amdpal*  - Radeon RX 6900 XT
 413      ``gfx1031``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6700 XT
 414                                                     - wavefrontsize64   flat          - *pal-amdhsa*
 415                                                                         scratch       - *pal-amdpal*
 416      ``gfx1032``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* *TBA*
 417                                                     - wavefrontsize64   flat          - *pal-amdhsa*
 418                                                                         scratch       - *pal-amdpal*  .. TODO::
 419
 420                                                                                                         Add product
 421                                                                                                         names.
 422
 423      ``gfx1033``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
 424                                                     - wavefrontsize64   flat
 425                                                                         scratch                       .. TODO::
 426
 427                                                                                                         Add product
 428                                                                                                         names.
 429      ``gfx1034``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *pal-amdpal*  *TBA*
 430                                                     - wavefrontsize64   flat
 431                                                                         scratch                       .. TODO::
 432
 433                                                                                                         Add product
 434                                                                                                         names.
 435
 436      ``gfx1035``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
 437                                                     - wavefrontsize64   flat
 438                                                                         scratch                       .. TODO::
 439                                                                                                         Add product
 440                                                                                                         names.
 441
 442      ``gfx1036``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
 443                                                     - wavefrontsize64   flat
 444                                                                         scratch                       .. TODO::
 445
 446                                                                                                         Add product
 447                                                                                                         names.
 448
 449      **GCN GFX11 (RDNA 3)** [AMD-GCN-GFX11-RDNA3]_
 450      -----------------------------------------------------------------------------------------------------------------------
 451      ``gfx1100``                 ``amdgcn``   dGPU  - cumode          - Architected   - *pal-amdpal*  *TBA*
 452                                                     - wavefrontsize64   flat
 453                                                                         scratch                       .. TODO::
 454                                                                       - Packed
 455                                                                         work-item                       Add product
 456                                                                         IDs                             names.
 457
 458      ``gfx1101``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
 459                                                     - wavefrontsize64   flat
 460                                                                         scratch                       .. TODO::
 461                                                                       - Packed
 462                                                                         work-item                       Add product
 463                                                                         IDs                             names.
 464
 465      ``gfx1102``                 ``amdgcn``   dGPU  - cumode          - Architected                   *TBA*
 466                                                     - wavefrontsize64   flat
 467                                                                         scratch                       .. TODO::
 468                                                                       - Packed
 469                                                                         work-item                       Add product
 470                                                                         IDs                             names.
 471
 472      ``gfx1103``                 ``amdgcn``   APU   - cumode          - Architected                   *TBA*
 473                                                     - wavefrontsize64   flat
 474                                                                         scratch                       .. TODO::
 475                                                                       - Packed
 476                                                                         work-item                       Add product
 477                                                                         IDs                             names.
 478
 479      =========== =============== ============ ===== ================= =============== =============== ======================
 480
 481 .. _amdgpu-target-features:
 482
 483 Target Features
 484 ---------------
 485
 486 Target features control how code is generated to support certain
 487 processor specific features. Not all target features are supported by
 488 all processors. The runtime must ensure that the features supported by
 489 the device used to execute the code match the features enabled when
 490 generating the code. A mismatch of features may result in incorrect
 491 execution, or a reduction in performance.
 492
 493 The target features supported by each processor is listed in
 494 :ref:`amdgpu-processor-table`.
 495
 496 Target features are controlled by exactly one of the following Clang
 497 options:
 498
 499 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
 500
 501   The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
 502   optional components of the target ID. If omitted, the target feature has the
 503   ``any`` value. See :ref:`amdgpu-target-id`.
 504
 505 ``-m[no-]<target-feature>``
 506
 507   Target features not specified by the target ID are specified using a
 508   separate option. These target features can have an ``on`` or ``off``
 509   value.  ``on`` is specified by omitting the ``no-`` prefix, and
 510   ``off`` is specified by including the ``no-`` prefix. The default
 511   if not specified is ``off``.
 512
 513 For example:
 514
 515 ``-mcpu=gfx908:xnack+``
 516   Enable the ``xnack`` feature.
 517 ``-mcpu=gfx908:xnack-``
 518   Disable the ``xnack`` feature.
 519 ``-mcumode``
 520   Enable the ``cumode`` feature.
 521 ``-mno-cumode``
 522   Disable the ``cumode`` feature.
 523
 524   .. table:: AMDGPU Target Features
 525      :name: amdgpu-target-features-table
 526
 527      =============== ============================ ==================================================
 528      Target Feature  Clang Option to Control      Description
 529      Name
 530      =============== ============================ ==================================================
 531      cumode          - ``-m[no-]cumode``          Control the wavefront execution mode used
 532                                                   when generating code for kernels. When disabled
 533                                                   native WGP wavefront execution mode is used,
 534                                                   when enabled CU wavefront execution mode is used
 535                                                   (see :ref:`amdgpu-amdhsa-memory-model`).
 536
 537      sramecc         - ``-mcpu``                  If specified, generate code that can only be
 538                      - ``--offload-arch``         loaded and executed in a process that has a
 539                                                   matching setting for SRAMECC.
 540
 541                                                   If not specified for code object V2 to V3, generate
 542                                                   code that can be loaded and executed in a process
 543                                                   with SRAMECC enabled.
 544
 545                                                   If not specified for code object V4 or above, generate
 546                                                   code that can be loaded and executed in a process
 547                                                   with either setting of SRAMECC.
 548
 549      tgsplit           ``-m[no-]tgsplit``         Enable/disable generating code that assumes
 550                                                   work-groups are launched in threadgroup split mode.
 551                                                   When enabled the waves of a work-group may be
 552                                                   launched in different CUs.
 553
 554      wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
 555                                                   generating code for kernels. When disabled
 556                                                   native wavefront size 32 is used, when enabled
 557                                                   wavefront size 64 is used.
 558
 559      xnack           - ``-mcpu``                  If specified, generate code that can only be
 560                      - ``--offload-arch``         loaded and executed in a process that has a
 561                                                   matching setting for XNACK replay.
 562
 563                                                   If not specified for code object V2 to V3, generate
 564                                                   code that can be loaded and executed in a process
 565                                                   with XNACK replay enabled.
 566
 567                                                   If not specified for code object V4 or above, generate
 568                                                   code that can be loaded and executed in a process
 569                                                   with either setting of XNACK replay.
 570
 571                                                   XNACK replay can be used for demand paging and
 572                                                   page migration. If enabled in the device, then if
 573                                                   a page fault occurs the code may execute
 574                                                   incorrectly unless generated with XNACK replay
 575                                                   enabled, or generated for code object V4 or above without
 576                                                   specifying XNACK replay. Executing code that was
 577                                                   generated with XNACK replay enabled, or generated
 578                                                   for code object V4 or above without specifying XNACK replay,
 579                                                   on a device that does not have XNACK replay
 580                                                   enabled will execute correctly but may be less
 581                                                   performant than code generated for XNACK replay
 582                                                   disabled.
 583      =============== ============================ ==================================================
 584
 585 .. _amdgpu-target-id:
 586
 587 Target ID
 588 ---------
 589
 590 AMDGPU supports target IDs. See `Clang Offload Bundler
 591 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
 592 description. The AMDGPU target specific information is:
 593
 594 **processor**
 595   Is an AMDGPU processor or alternative processor name specified in
 596   :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
 597   the primary processor and alternative processor names. The canonical form
 598   target ID only allow the primary processor name.
 599
 600 **target-feature**
 601   Is a target feature name specified in :ref:`amdgpu-target-features-table` that
 602   is supported by the processor. The target features supported by each processor
 603   is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
 604   a target ID are marked as being controlled by ``-mcpu`` and
 605   ``--offload-arch``. Each target feature must appear at most once in a target
 606   ID. The non-canonical form target ID allows the target features to be
 607   specified in any order. The canonical form target ID requires the target
 608   features to be specified in alphabetic order.
 609
 610 .. _amdgpu-target-id-v2-v3:
 611
 612 Code Object V2 to V3 Target ID
 613 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 614
 615 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
 616 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
 617 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
 618 directive and the bundle entry ID. In those cases it has the following BNF
 619 syntax:
 620
 621 .. code::
 622
 623   <target-id> ::== <processor> ( "+" <target-feature> )*
 624
 625 Where a target feature is omitted if *Off* and present if *On* or *Any*.
 626
 627 .. note::
 628
 629   The code object V2 to V3 cannot represent *Any* and treats it the same as
 630   *On*.
 631
 632 .. _amdgpu-embedding-bundled-objects:
 633
 634 Embedding Bundled Code Objects
 635 ------------------------------
 636
 637 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
 638 as described in `Clang Offload Bundler
 639 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
 640
 641 .. note::
 642
 643   The target ID syntax used for code object V2 to V3 for a bundle entry ID
 644   differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
 645
 646 .. _amdgpu-address-spaces:
 647
 648 Address Spaces
 649 --------------
 650
 651 The AMDGPU architecture supports a number of memory address spaces. The address
 652 space names use the OpenCL standard names, with some additions.
 653
 654 The AMDGPU address spaces correspond to target architecture specific LLVM
 655 address space numbers used in LLVM IR.
 656
 657 The AMDGPU address spaces are described in
 658 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
 659 supported for the ``amdgcn`` target.
 660
 661   .. table:: AMDGPU Address Spaces
 662      :name: amdgpu-address-spaces-table
 663
 664      ================================= =============== =========== ================ ======= ============================
 665      ..                                                                                     64-Bit Process Address Space
 666      --------------------------------- --------------- ----------- ---------------- ------------------------------------
 667      Address Space Name                LLVM IR Address HSA Segment Hardware         Address NULL Value
 668                                        Space Number    Name        Name             Size
 669      ================================= =============== =========== ================ ======= ============================
 670      Generic                           0               flat        flat             64      0x0000000000000000
 671      Global                            1               global      global           64      0x0000000000000000
 672      Region                            2               N/A         GDS              32      *not implemented for AMDHSA*
 673      Local                             3               group       LDS              32      0xFFFFFFFF
 674      Constant                          4               constant    *same as global* 64      0x0000000000000000
 675      Private                           5               private     scratch          32      0xFFFFFFFF
 676      Constant 32-bit                   6               *TODO*                               0x00000000
 677      Buffer Fat Pointer (experimental) 7               *TODO*
 678      ================================= =============== =========== ================ ======= ============================
 679
 680 **Generic**
 681   The generic address space is supported unless the *Target Properties* column
 682   of :ref:`amdgpu-processor-table` specifies *Does not support generic address
 683   space*.
 684
 685   The generic address space uses the hardware flat address support for two fixed
 686   ranges of virtual addresses (the private and local apertures), that are
 687   outside the range of addressable global memory, to map from a flat address to
 688   a private or local address. This uses FLAT instructions that can take a flat
 689   address and access global, private (scratch), and group (LDS) memory depending
 690   on if the address is within one of the aperture ranges.
 691
 692   Flat access to scratch requires hardware aperture setup and setup in the
 693   kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
 694   access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
 695   setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
 696
 697   To convert between a private or group address space address (termed a segment
 698   address) and a flat address the base address of the corresponding aperture
 699   can be used. For GFX7-GFX8 these are available in the
 700   :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
 701   Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
 702   GFX9-GFX11 the aperture base addresses are directly available as inline
 703   constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
 704   In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
 705   aligned to 2^32 which makes it easier to convert from flat to segment or
 706   segment to flat.
 707
 708   A global address space address has the same value when used as a flat address
 709   so no conversion is needed.
 710
 711 **Global and Constant**
 712   The global and constant address spaces both use global virtual addresses,
 713   which are the same virtual address space used by the CPU. However, some
 714   virtual addresses may only be accessible to the CPU, some only accessible
 715   by the GPU, and some by both.
 716
 717   Using the constant address space indicates that the data will not change
 718   during the execution of the kernel. This allows scalar read instructions to
 719   be used. As the constant address space could only be modified on the host
 720   side, a generic pointer loaded from the constant address space is safe to be
 721   assumed as a global pointer since only the device global memory is visible
 722   and managed on the host side. The vector and scalar L1 caches are invalidated
 723   of volatile data before each kernel dispatch execution to allow constant
 724   memory to change values between kernel dispatches.
 725
 726 **Region**
 727   The region address space uses the hardware Global Data Store (GDS). All
 728   wavefronts executing on the same device will access the same memory for any
 729   given region address. However, the same region address accessed by wavefronts
 730   executing on different devices will access different memory. It is higher
 731   performance than global memory. It is allocated by the runtime. The data
 732   store (DS) instructions can be used to access it.
 733
 734 **Local**
 735   The local address space uses the hardware Local Data Store (LDS) which is
 736   automatically allocated when the hardware creates the wavefronts of a
 737   work-group, and freed when all the wavefronts of a work-group have
 738   terminated. All wavefronts belonging to the same work-group will access the
 739   same memory for any given local address. However, the same local address
 740   accessed by wavefronts belonging to different work-groups will access
 741   different memory. It is higher performance than global memory. The data store
 742   (DS) instructions can be used to access it.
 743
 744 **Private**
 745   The private address space uses the hardware scratch memory support which
 746   automatically allocates memory when it creates a wavefront and frees it when
 747   a wavefronts terminates. The memory accessed by a lane of a wavefront for any
 748   given private address will be different to the memory accessed by another lane
 749   of the same or different wavefront for the same private address.
 750
 751   If a kernel dispatch uses scratch, then the hardware allocates memory from a
 752   pool of backing memory allocated by the runtime for each wavefront. The lanes
 753   of the wavefront access this using dword (4 byte) interleaving. The mapping
 754   used from private address to backing memory address is:
 755
 756     ``wavefront-scratch-base +
 757     ((private-address / 4) * wavefront-size * 4) +
 758     (wavefront-lane-id * 4) + (private-address % 4)``
 759
 760   If each lane of a wavefront accesses the same private address, the
 761   interleaving results in adjacent dwords being accessed and hence requires
 762   fewer cache lines to be fetched.
 763
 764   There are different ways that the wavefront scratch base address is
 765   determined by a wavefront (see
 766   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 767
 768   Scratch memory can be accessed in an interleaved manner using buffer
 769   instructions with the scratch buffer descriptor and per wavefront scratch
 770   offset, by the scratch instructions, or by flat instructions. Multi-dword
 771   access is not supported except by flat and scratch instructions in
 772   GFX9-GFX11.
 773
 774 **Constant 32-bit**
 775   *TODO*
 776
 777 **Buffer Fat Pointer**
 778   The buffer fat pointer is an experimental address space that is currently
 779   unsupported in the backend. It exposes a non-integral pointer that is in
 780   the future intended to support the modelling of 128-bit buffer descriptors
 781   plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
 782   *pointer*), allowing normal LLVM load/store/atomic operations to be used to
 783   model the buffer descriptors used heavily in graphics workloads targeting
 784   the backend.
 785
 786 .. _amdgpu-memory-scopes:
 787
 788 Memory Scopes
 789 -------------
 790
 791 This section provides LLVM memory synchronization scopes supported by the AMDGPU
 792 backend memory model when the target triple OS is ``amdhsa`` (see
 793 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 794
 795 The memory model supported is based on the HSA memory model [HSA]_ which is
 796 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 797 relation is transitive over the synchronizes-with relation independent of scope
 798 and synchronizes-with allows the memory scope instances to be inclusive (see
 799 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
 800
 801 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 802 inclusion and requires the memory scopes to exactly match. However, this
 803 is conservatively correct for OpenCL.
 804
 805   .. table:: AMDHSA LLVM Sync Scopes
 806      :name: amdgpu-amdhsa-llvm-sync-scopes-table
 807
 808      ======================= ===================================================
 809      LLVM Sync Scope         Description
 810      ======================= ===================================================
 811      *none*                  The default: ``system``.
 812
 813                              Synchronizes with, and participates in modification
 814                              and seq_cst total orderings with, other operations
 815                              (except image operations) for all address spaces
 816                              (except private, or generic that accesses private)
 817                              provided the other operation's sync scope is:
 818
 819                              - ``system``.
 820                              - ``agent`` and executed by a thread on the same
 821                                agent.
 822                              - ``workgroup`` and executed by a thread in the
 823                                same work-group.
 824                              - ``wavefront`` and executed by a thread in the
 825                                same wavefront.
 826
 827      ``agent``               Synchronizes with, and participates in modification
 828                              and seq_cst total orderings with, other operations
 829                              (except image operations) for all address spaces
 830                              (except private, or generic that accesses private)
 831                              provided the other operation's sync scope is:
 832
 833                              - ``system`` or ``agent`` and executed by a thread
 834                                on the same agent.
 835                              - ``workgroup`` and executed by a thread in the
 836                                same work-group.
 837                              - ``wavefront`` and executed by a thread in the
 838                                same wavefront.
 839
 840      ``workgroup``           Synchronizes with, and participates in modification
 841                              and seq_cst total orderings with, other operations
 842                              (except image operations) for all address spaces
 843                              (except private, or generic that accesses private)
 844                              provided the other operation's sync scope is:
 845
 846                              - ``system``, ``agent`` or ``workgroup`` and
 847                                executed by a thread in the same work-group.
 848                              - ``wavefront`` and executed by a thread in the
 849                                same wavefront.
 850
 851      ``wavefront``           Synchronizes with, and participates in modification
 852                              and seq_cst total orderings with, other operations
 853                              (except image operations) for all address spaces
 854                              (except private, or generic that accesses private)
 855                              provided the other operation's sync scope is:
 856
 857                              - ``system``, ``agent``, ``workgroup`` or
 858                                ``wavefront`` and executed by a thread in the
 859                                same wavefront.
 860
 861      ``singlethread``        Only synchronizes with and participates in
 862                              modification and seq_cst total orderings with,
 863                              other operations (except image operations) running
 864                              in the same thread for all address spaces (for
 865                              example, in signal handlers).
 866
 867      ``one-as``              Same as ``system`` but only synchronizes with other
 868                              operations within the same address space.
 869
 870      ``agent-one-as``        Same as ``agent`` but only synchronizes with other
 871                              operations within the same address space.
 872
 873      ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
 874                              other operations within the same address space.
 875
 876      ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
 877                              other operations within the same address space.
 878
 879      ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
 880                              other operations within the same address space.
 881      ======================= ===================================================
 882
 883 LLVM IR Intrinsics
 884 ------------------
 885
 886 The AMDGPU backend implements the following LLVM IR intrinsics.
 887
 888 *This section is WIP.*
 889
 890 .. TODO::
 891
 892    List AMDGPU intrinsics.
 893
 894 LLVM IR Attributes
 895 ------------------
 896
 897 The AMDGPU backend supports the following LLVM IR attributes.
 898
 899   .. table:: AMDGPU LLVM IR Attributes
 900      :name: amdgpu-llvm-ir-attributes-table
 901
 902      ======================================= ==========================================================
 903      LLVM Attribute                          Description
 904      ======================================= ==========================================================
 905      "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
 906                                              will be specified when the kernel is dispatched. Generated
 907                                              by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
 908                                              The implied default value is 1,1024.
 909
 910      "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
 911                                              argument block size for the implicit arguments. This
 912                                              varies by OS and language (for OpenCL see
 913                                              :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
 914      "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
 915                                              the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
 916      "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
 917                                              ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
 918      "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
 919                                              execution unit. Generated by the ``amdgpu_waves_per_eu``
 920                                              CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
 921                                              and the backend may not be able to satisfy the request. If
 922                                              the specified range is incompatible with the function's
 923                                              "amdgpu-flat-work-group-size" value, the implied occupancy
 924                                              bounds by the workgroup size takes precedence.
 925
 926      "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
 927                                              mode register to be set on entry. Overrides the default for
 928                                              the calling convention.
 929      "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
 930                                              the mode register to be set on entry. Overrides the default
 931                                              for the calling convention.
 932
 933      "amdgpu-no-workitem-id-x"               Indicates the function does not depend on the value of the
 934                                              llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
 935                                              attribute, or reached through a call site marked with this attribute,
 936                                              the value returned by the intrinsic is undefined. The backend can
 937                                              generally infer this during code generation, so typically there is no
 938                                              benefit to frontends marking functions with this.
 939
 940      "amdgpu-no-workitem-id-y"               The same as amdgpu-no-workitem-id-x, except for the
 941                                              llvm.amdgcn.workitem.id.y intrinsic.
 942
 943      "amdgpu-no-workitem-id-z"               The same as amdgpu-no-workitem-id-x, except for the
 944                                              llvm.amdgcn.workitem.id.z intrinsic.
 945
 946      "amdgpu-no-workgroup-id-x"              The same as amdgpu-no-workitem-id-x, except for the
 947                                              llvm.amdgcn.workgroup.id.x intrinsic.
 948
 949      "amdgpu-no-workgroup-id-y"              The same as amdgpu-no-workitem-id-x, except for the
 950                                              llvm.amdgcn.workgroup.id.y intrinsic.
 951
 952      "amdgpu-no-workgroup-id-z"              The same as amdgpu-no-workitem-id-x, except for the
 953                                              llvm.amdgcn.workgroup.id.z intrinsic.
 954
 955      "amdgpu-no-dispatch-ptr"                The same as amdgpu-no-workitem-id-x, except for the
 956                                              llvm.amdgcn.dispatch.ptr intrinsic.
 957
 958      "amdgpu-no-implicitarg-ptr"             The same as amdgpu-no-workitem-id-x, except for the
 959                                              llvm.amdgcn.implicitarg.ptr intrinsic.
 960
 961      "amdgpu-no-dispatch-id"                 The same as amdgpu-no-workitem-id-x, except for the
 962                                              llvm.amdgcn.dispatch.id intrinsic.
 963
 964      "amdgpu-no-queue-ptr"                   Similar to amdgpu-no-workitem-id-x, except for the
 965                                              llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
 966                                              attributes, the queue pointer may be required in situations where the
 967                                              intrinsic call does not directly appear in the program. Some subtargets
 968                                              require the queue pointer for to handle some addrspacecasts, as well
 969                                              as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
 970                                              llvm.debug intrinsics.
 971
 972      "amdgpu-no-hostcall-ptr"                Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
 973                                              kernel argument that holds the pointer to the hostcall buffer. If this
 974                                              attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
 975
 976      "amdgpu-no-heap-ptr"                    Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
 977                                              kernel argument that holds the pointer to an initialized memory buffer
 978                                              that conforms to the requirements of the malloc/free device library V1
 979                                              version implementation. If this attribute is absent, then the
 980                                              amdgpu-no-implicitarg-ptr is also removed.
 981
 982      "amdgpu-no-multigrid-sync-arg"          Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
 983                                              kernel argument that holds the multigrid synchronization pointer. If this
 984                                              attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
 985
 986      "amdgpu-no-default-queue"               Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
 987                                              kernel argument that holds the default queue pointer. If this
 988                                              attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
 989
 990      "amdgpu-no-completion-action"           Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
 991                                              kernel argument that holds the completion action pointer. If this
 992                                              attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
 993
 994      ======================================= ==========================================================
 995
 996 .. _amdgpu-elf-code-object:
 997
 998 ELF Code Object
 999 ===============
1000
1001 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
1002 can be linked by ``lld`` to produce a standard ELF shared code object which can
1003 be loaded and executed on an AMDGPU target.
1004
1005 .. _amdgpu-elf-header:
1006
1007 Header
1008 ------
1009
1010 The AMDGPU backend uses the following ELF header:
1011
1012   .. table:: AMDGPU ELF Header
1013      :name: amdgpu-elf-header-table
1014
1015      ========================== ===============================
1016      Field                      Value
1017      ========================== ===============================
1018      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
1019      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
1020      ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
1021                                 - ``ELFOSABI_AMDGPU_HSA``
1022                                 - ``ELFOSABI_AMDGPU_PAL``
1023                                 - ``ELFOSABI_AMDGPU_MESA3D``
1024      ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1025                                 - ``ELFABIVERSION_AMDGPU_HSA_V3``
1026                                 - ``ELFABIVERSION_AMDGPU_HSA_V4``
1027                                 - ``ELFABIVERSION_AMDGPU_HSA_V5``
1028                                 - ``ELFABIVERSION_AMDGPU_PAL``
1029                                 - ``ELFABIVERSION_AMDGPU_MESA3D``
1030      ``e_type``                 - ``ET_REL``
1031                                 - ``ET_DYN``
1032      ``e_machine``              ``EM_AMDGPU``
1033      ``e_entry``                0
1034      ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1035                                 :ref:`amdgpu-elf-header-e_flags-table-v3`,
1036                                 and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
1037      ========================== ===============================
1038
1039 ..
1040
1041   .. table:: AMDGPU ELF Header Enumeration Values
1042      :name: amdgpu-elf-header-enumeration-values-table
1043
1044      =============================== =====
1045      Name                            Value
1046      =============================== =====
1047      ``EM_AMDGPU``                   224
1048      ``ELFOSABI_NONE``               0
1049      ``ELFOSABI_AMDGPU_HSA``         64
1050      ``ELFOSABI_AMDGPU_PAL``         65
1051      ``ELFOSABI_AMDGPU_MESA3D``      66
1052      ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1053      ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1054      ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1055      ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1056      ``ELFABIVERSION_AMDGPU_PAL``    0
1057      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1058      =============================== =====
1059
1060 ``e_ident[EI_CLASS]``
1061   The ELF class is:
1062
1063   * ``ELFCLASS32`` for ``r600`` architecture.
1064
1065   * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1066     process address space applications.
1067
1068 ``e_ident[EI_DATA]``
1069   All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1070
1071 ``e_ident[EI_OSABI]``
1072   One of the following AMDGPU target architecture specific OS ABIs
1073   (see :ref:`amdgpu-os`):
1074
1075   * ``ELFOSABI_NONE`` for *unknown* OS.
1076
1077   * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1078
1079   * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1080
1081   * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1082
1083 ``e_ident[EI_ABIVERSION]``
1084   The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1085   object conforms:
1086
1087   * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1088     runtime ABI for code object V2. Specify using the Clang option
1089     ``-mcode-object-version=2``.
1090
1091   * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1092     runtime ABI for code object V3. Specify using the Clang option
1093     ``-mcode-object-version=3``.
1094
1095   * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1096     runtime ABI for code object V4. Specify using the Clang option
1097     ``-mcode-object-version=4``. This is the default code object
1098     version if not specified.
1099
1100   * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1101     runtime ABI for code object V5. Specify using the Clang option
1102     ``-mcode-object-version=5``.
1103
1104   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1105     runtime ABI.
1106
1107   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1108     3D runtime ABI.
1109
1110 ``e_type``
1111   Can be one of the following values:
1112
1113
1114   ``ET_REL``
1115     The type produced by the AMDGPU backend compiler as it is relocatable code
1116     object.
1117
1118   ``ET_DYN``
1119     The type produced by the linker as it is a shared code object.
1120
1121   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1122
1123 ``e_machine``
1124   The value ``EM_AMDGPU`` is used for the machine for all processors supported
1125   by the ``r600`` and ``amdgcn`` architectures (see
1126   :ref:`amdgpu-processor-table`). The specific processor is specified in the
1127   ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1128   :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1129   ``e_flags`` for code object V3 and above (see
1130   :ref:`amdgpu-elf-header-e_flags-table-v3` and
1131   :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
1132
1133 ``e_entry``
1134   The entry point is 0 as the entry points for individual kernels must be
1135   selected in order to invoke them through AQL packets.
1136
1137 ``e_flags``
1138   The AMDGPU backend uses the following ELF header flags:
1139
1140   .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1141      :name: amdgpu-elf-header-e_flags-v2-table
1142
1143      ===================================== ===== =============================
1144      Name                                  Value Description
1145      ===================================== ===== =============================
1146      ``EF_AMDGPU_FEATURE_XNACK_V2``        0x01  Indicates if the ``xnack``
1147                                                  target feature is
1148                                                  enabled for all code
1149                                                  contained in the code object.
1150                                                  If the processor
1151                                                  does not support the
1152                                                  ``xnack`` target
1153                                                  feature then must
1154                                                  be 0.
1155                                                  See
1156                                                  :ref:`amdgpu-target-features`.
1157      ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02  Indicates if the trap
1158                                                  handler is enabled for all
1159                                                  code contained in the code
1160                                                  object. If the processor
1161                                                  does not support a trap
1162                                                  handler then must be 0.
1163                                                  See
1164                                                  :ref:`amdgpu-target-features`.
1165      ===================================== ===== =============================
1166
1167   .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1168      :name: amdgpu-elf-header-e_flags-table-v3
1169
1170      ================================= ===== =============================
1171      Name                              Value Description
1172      ================================= ===== =============================
1173      ``EF_AMDGPU_MACH``                0x0ff AMDGPU processor selection
1174                                              mask for
1175                                              ``EF_AMDGPU_MACH_xxx`` values
1176                                              defined in
1177                                              :ref:`amdgpu-ef-amdgpu-mach-table`.
1178      ``EF_AMDGPU_FEATURE_XNACK_V3``    0x100 Indicates if the ``xnack``
1179                                              target feature is
1180                                              enabled for all code
1181                                              contained in the code object.
1182                                              If the processor
1183                                              does not support the
1184                                              ``xnack`` target
1185                                              feature then must
1186                                              be 0.
1187                                              See
1188                                              :ref:`amdgpu-target-features`.
1189      ``EF_AMDGPU_FEATURE_SRAMECC_V3``  0x200 Indicates if the ``sramecc``
1190                                              target feature is
1191                                              enabled for all code
1192                                              contained in the code object.
1193                                              If the processor
1194                                              does not support the
1195                                              ``sramecc`` target
1196                                              feature then must
1197                                              be 0.
1198                                              See
1199                                              :ref:`amdgpu-target-features`.
1200      ================================= ===== =============================
1201
1202   .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
1203      :name: amdgpu-elf-header-e_flags-table-v4-onwards
1204
1205      ============================================ ===== ===================================
1206      Name                                         Value      Description
1207      ============================================ ===== ===================================
1208      ``EF_AMDGPU_MACH``                           0x0ff AMDGPU processor selection
1209                                                         mask for
1210                                                         ``EF_AMDGPU_MACH_xxx`` values
1211                                                         defined in
1212                                                         :ref:`amdgpu-ef-amdgpu-mach-table`.
1213      ``EF_AMDGPU_FEATURE_XNACK_V4``               0x300 XNACK selection mask for
1214                                                         ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1215                                                         values.
1216      ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4``   0x000 XNACK unsuppored.
1217      ``EF_AMDGPU_FEATURE_XNACK_ANY_V4``           0x100 XNACK can have any value.
1218      ``EF_AMDGPU_FEATURE_XNACK_OFF_V4``           0x200 XNACK disabled.
1219      ``EF_AMDGPU_FEATURE_XNACK_ON_V4``            0x300 XNACK enabled.
1220      ``EF_AMDGPU_FEATURE_SRAMECC_V4``             0xc00 SRAMECC selection mask for
1221                                                         ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1222                                                         values.
1223      ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
1224      ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4``         0x400 SRAMECC can have any value.
1225      ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4``         0x800 SRAMECC disabled,
1226      ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4``          0xc00 SRAMECC enabled.
1227      ============================================ ===== ===================================
1228
1229   .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1230      :name: amdgpu-ef-amdgpu-mach-table
1231
1232      ==================================== ========== =============================
1233      Name                                 Value      Description (see
1234                                                      :ref:`amdgpu-processor-table`)
1235      ==================================== ========== =============================
1236      ``EF_AMDGPU_MACH_NONE``              0x000      *not specified*
1237      ``EF_AMDGPU_MACH_R600_R600``         0x001      ``r600``
1238      ``EF_AMDGPU_MACH_R600_R630``         0x002      ``r630``
1239      ``EF_AMDGPU_MACH_R600_RS880``        0x003      ``rs880``
1240      ``EF_AMDGPU_MACH_R600_RV670``        0x004      ``rv670``
1241      ``EF_AMDGPU_MACH_R600_RV710``        0x005      ``rv710``
1242      ``EF_AMDGPU_MACH_R600_RV730``        0x006      ``rv730``
1243      ``EF_AMDGPU_MACH_R600_RV770``        0x007      ``rv770``
1244      ``EF_AMDGPU_MACH_R600_CEDAR``        0x008      ``cedar``
1245      ``EF_AMDGPU_MACH_R600_CYPRESS``      0x009      ``cypress``
1246      ``EF_AMDGPU_MACH_R600_JUNIPER``      0x00a      ``juniper``
1247      ``EF_AMDGPU_MACH_R600_REDWOOD``      0x00b      ``redwood``
1248      ``EF_AMDGPU_MACH_R600_SUMO``         0x00c      ``sumo``
1249      ``EF_AMDGPU_MACH_R600_BARTS``        0x00d      ``barts``
1250      ``EF_AMDGPU_MACH_R600_CAICOS``       0x00e      ``caicos``
1251      ``EF_AMDGPU_MACH_R600_CAYMAN``       0x00f      ``cayman``
1252      ``EF_AMDGPU_MACH_R600_TURKS``        0x010      ``turks``
1253      *reserved*                           0x011 -    Reserved for ``r600``
1254                                           0x01f      architecture processors.
1255      ``EF_AMDGPU_MACH_AMDGCN_GFX600``     0x020      ``gfx600``
1256      ``EF_AMDGPU_MACH_AMDGCN_GFX601``     0x021      ``gfx601``
1257      ``EF_AMDGPU_MACH_AMDGCN_GFX700``     0x022      ``gfx700``
1258      ``EF_AMDGPU_MACH_AMDGCN_GFX701``     0x023      ``gfx701``
1259      ``EF_AMDGPU_MACH_AMDGCN_GFX702``     0x024      ``gfx702``
1260      ``EF_AMDGPU_MACH_AMDGCN_GFX703``     0x025      ``gfx703``
1261      ``EF_AMDGPU_MACH_AMDGCN_GFX704``     0x026      ``gfx704``
1262      *reserved*                           0x027      Reserved.
1263      ``EF_AMDGPU_MACH_AMDGCN_GFX801``     0x028      ``gfx801``
1264      ``EF_AMDGPU_MACH_AMDGCN_GFX802``     0x029      ``gfx802``
1265      ``EF_AMDGPU_MACH_AMDGCN_GFX803``     0x02a      ``gfx803``
1266      ``EF_AMDGPU_MACH_AMDGCN_GFX810``     0x02b      ``gfx810``
1267      ``EF_AMDGPU_MACH_AMDGCN_GFX900``     0x02c      ``gfx900``
1268      ``EF_AMDGPU_MACH_AMDGCN_GFX902``     0x02d      ``gfx902``
1269      ``EF_AMDGPU_MACH_AMDGCN_GFX904``     0x02e      ``gfx904``
1270      ``EF_AMDGPU_MACH_AMDGCN_GFX906``     0x02f      ``gfx906``
1271      ``EF_AMDGPU_MACH_AMDGCN_GFX908``     0x030      ``gfx908``
1272      ``EF_AMDGPU_MACH_AMDGCN_GFX909``     0x031      ``gfx909``
1273      ``EF_AMDGPU_MACH_AMDGCN_GFX90C``     0x032      ``gfx90c``
1274      ``EF_AMDGPU_MACH_AMDGCN_GFX1010``    0x033      ``gfx1010``
1275      ``EF_AMDGPU_MACH_AMDGCN_GFX1011``    0x034      ``gfx1011``
1276      ``EF_AMDGPU_MACH_AMDGCN_GFX1012``    0x035      ``gfx1012``
1277      ``EF_AMDGPU_MACH_AMDGCN_GFX1030``    0x036      ``gfx1030``
1278      ``EF_AMDGPU_MACH_AMDGCN_GFX1031``    0x037      ``gfx1031``
1279      ``EF_AMDGPU_MACH_AMDGCN_GFX1032``    0x038      ``gfx1032``
1280      ``EF_AMDGPU_MACH_AMDGCN_GFX1033``    0x039      ``gfx1033``
1281      ``EF_AMDGPU_MACH_AMDGCN_GFX602``     0x03a      ``gfx602``
1282      ``EF_AMDGPU_MACH_AMDGCN_GFX705``     0x03b      ``gfx705``
1283      ``EF_AMDGPU_MACH_AMDGCN_GFX805``     0x03c      ``gfx805``
1284      ``EF_AMDGPU_MACH_AMDGCN_GFX1035``    0x03d      ``gfx1035``
1285      ``EF_AMDGPU_MACH_AMDGCN_GFX1034``    0x03e      ``gfx1034``
1286      ``EF_AMDGPU_MACH_AMDGCN_GFX90A``     0x03f      ``gfx90a``
1287      ``EF_AMDGPU_MACH_AMDGCN_GFX940``     0x040      ``gfx940``
1288      ``EF_AMDGPU_MACH_AMDGCN_GFX1100``    0x041      ``gfx1100``
1289      ``EF_AMDGPU_MACH_AMDGCN_GFX1013``    0x042      ``gfx1013``
1290      *reserved*                           0x043      Reserved.
1291      ``EF_AMDGPU_MACH_AMDGCN_GFX1103``    0x044      ``gfx1103``
1292      ``EF_AMDGPU_MACH_AMDGCN_GFX1036``    0x045      ``gfx1036``
1293      ``EF_AMDGPU_MACH_AMDGCN_GFX1101``    0x046      ``gfx1101``
1294      ``EF_AMDGPU_MACH_AMDGCN_GFX1102``    0x047      ``gfx1102``
1295      ==================================== ========== =============================
1296
1297 Sections
1298 --------
1299
1300 An AMDGPU target ELF code object has the standard ELF sections which include:
1301
1302   .. table:: AMDGPU ELF Sections
1303      :name: amdgpu-elf-sections-table
1304
1305      ================== ================ =================================
1306      Name               Type             Attributes
1307      ================== ================ =================================
1308      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
1309      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1310      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
1311      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
1312      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1313      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1314      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1315      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
1316      ``.note``          ``SHT_NOTE``     *none*
1317      ``.rela``\ *name*  ``SHT_RELA``     *none*
1318      ``.rela.dyn``      ``SHT_RELA``     *none*
1319      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1320      ``.shstrtab``      ``SHT_STRTAB``   *none*
1321      ``.strtab``        ``SHT_STRTAB``   *none*
1322      ``.symtab``        ``SHT_SYMTAB``   *none*
1323      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1324      ================== ================ =================================
1325
1326 These sections have their standard meanings (see [ELF]_) and are only generated
1327 if needed.
1328
1329 ``.debug``\ *\**
1330   The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1331   information on the DWARF produced by the AMDGPU backend.
1332
1333 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1334   The standard sections used by a dynamic loader.
1335
1336 ``.note``
1337   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1338   backend.
1339
1340 ``.rela``\ *name*, ``.rela.dyn``
1341   For relocatable code objects, *name* is the name of the section that the
1342   relocation records apply. For example, ``.rela.text`` is the section name for
1343   relocation records associated with the ``.text`` section.
1344
1345   For linked shared code objects, ``.rela.dyn`` contains all the relocation
1346   records from each of the relocatable code object's ``.rela``\ *name* sections.
1347
1348   See :ref:`amdgpu-relocation-records` for the relocation records supported by
1349   the AMDGPU backend.
1350
1351 ``.text``
1352   The executable machine code for the kernels and functions they call. Generated
1353   as position independent code. See :ref:`amdgpu-code-conventions` for
1354   information on conventions used in the isa generation.
1355
1356 .. _amdgpu-note-records:
1357
1358 Note Records
1359 ------------
1360
1361 The AMDGPU backend code object contains ELF note records in the ``.note``
1362 section. The set of generated notes and their semantics depend on the code
1363 object version; see :ref:`amdgpu-note-records-v2` and
1364 :ref:`amdgpu-note-records-v3-onwards`.
1365
1366 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1367 must be generated after the ``name`` field to ensure the ``desc`` field is 4
1368 byte aligned. In addition, minimal zero-byte padding must be generated to
1369 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1370 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1371 alignment.
1372
1373 .. _amdgpu-note-records-v2:
1374
1375 Code Object V2 Note Records
1376 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1377
1378 .. warning::
1379   Code object V2 is not the default code object version emitted by
1380   this version of LLVM.
1381
1382 The AMDGPU backend code object uses the following ELF note record in the
1383 ``.note`` section when compiling for code object V2.
1384
1385 The note record vendor field is "AMD".
1386
1387 Additional note records may be present, but any which are not documented here
1388 are deprecated and should not be used.
1389
1390   .. table:: AMDGPU Code Object V2 ELF Note Records
1391      :name: amdgpu-elf-note-records-v2-table
1392
1393      ===== ===================================== ======================================
1394      Name  Type                                  Description
1395      ===== ===================================== ======================================
1396      "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION``    Code object version.
1397      "AMD" ``NT_AMD_HSA_HSAIL``                  HSAIL properties generated by the HSAIL
1398                                                  Finalizer and not the LLVM compiler.
1399      "AMD" ``NT_AMD_HSA_ISA_VERSION``            Target ISA version.
1400      "AMD" ``NT_AMD_HSA_METADATA``               Metadata null terminated string in
1401                                                  YAML [YAML]_ textual format.
1402      "AMD" ``NT_AMD_HSA_ISA_NAME``               Target ISA name.
1403      ===== ===================================== ======================================
1404
1405 ..
1406
1407   .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1408      :name: amdgpu-elf-note-record-enumeration-values-v2-table
1409
1410      ===================================== =====
1411      Name                                  Value
1412      ===================================== =====
1413      ``NT_AMD_HSA_CODE_OBJECT_VERSION``    1
1414      ``NT_AMD_HSA_HSAIL``                  2
1415      ``NT_AMD_HSA_ISA_VERSION``            3
1416      *reserved*                            4-9
1417      ``NT_AMD_HSA_METADATA``               10
1418      ``NT_AMD_HSA_ISA_NAME``               11
1419      ===================================== =====
1420
1421 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
1422   Specifies the code object version number. The description field has the
1423   following layout:
1424
1425   .. code:: c
1426
1427     struct amdgpu_hsa_note_code_object_version_s {
1428       uint32_t major_version;
1429       uint32_t minor_version;
1430     };
1431
1432   The ``major_version`` has a value less than or equal to 2.
1433
1434 ``NT_AMD_HSA_HSAIL``
1435   Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1436   field has the following layout:
1437
1438   .. code:: c
1439
1440     struct amdgpu_hsa_note_hsail_s {
1441       uint32_t hsail_major_version;
1442       uint32_t hsail_minor_version;
1443       uint8_t profile;
1444       uint8_t machine_model;
1445       uint8_t default_float_round;
1446     };
1447
1448 ``NT_AMD_HSA_ISA_VERSION``
1449   Specifies the target ISA version. The description field has the following layout:
1450
1451   .. code:: c
1452
1453     struct amdgpu_hsa_note_isa_s {
1454       uint16_t vendor_name_size;
1455       uint16_t architecture_name_size;
1456       uint32_t major;
1457       uint32_t minor;
1458       uint32_t stepping;
1459       char vendor_and_architecture_name[1];
1460     };
1461
1462   ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1463   vendor and architecture names respectively, including the NUL character.
1464
1465   ``vendor_and_architecture_name`` contains the NUL terminates string for the
1466   vendor, immediately followed by the NUL terminated string for the
1467   architecture.
1468
1469   This note record is used by the HSA runtime loader.
1470
1471   Code object V2 only supports a limited number of processors and has fixed
1472   settings for target features. See
1473   :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1474   processors and the corresponding target ID. In the table the note record ISA
1475   name is a concatenation of the vendor name, architecture name, major, minor,
1476   and stepping separated by a ":".
1477
1478   The target ID column shows the processor name and fixed target features used
1479   by the LLVM compiler. The LLVM compiler does not generate a
1480   ``NT_AMD_HSA_HSAIL`` note record.
1481
1482   A code object generated by the Finalizer also uses code object V2 and always
1483   generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1484   ``sramecc`` target feature is as shown in
1485   :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1486   target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1487   bit.
1488
1489 ``NT_AMD_HSA_ISA_NAME``
1490   Specifies the target ISA name as a non-NUL terminated string.
1491
1492   This note record is not used by the HSA runtime loader.
1493
1494   See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1495   V2's limited support of processors and fixed settings for target features.
1496
1497   See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1498   from the string to the corresponding target ID. If the ``xnack`` target
1499   feature is supported and enabled, the string produced by the LLVM compiler
1500   will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1501   instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1502
1503 ``NT_AMD_HSA_METADATA``
1504   Specifies extensible metadata associated with the code objects executed on HSA
1505   [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1506   target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1507   :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1508   metadata string.
1509
1510   .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1511      :name: amdgpu-elf-note-record-supported_processors-v2-table
1512
1513      ===================== ==========================
1514      Note Record ISA Name  Target ID
1515      ===================== ==========================
1516      ``AMD:AMDGPU:6:0:0``  ``gfx600``
1517      ``AMD:AMDGPU:6:0:1``  ``gfx601``
1518      ``AMD:AMDGPU:6:0:2``  ``gfx602``
1519      ``AMD:AMDGPU:7:0:0``  ``gfx700``
1520      ``AMD:AMDGPU:7:0:1``  ``gfx701``
1521      ``AMD:AMDGPU:7:0:2``  ``gfx702``
1522      ``AMD:AMDGPU:7:0:3``  ``gfx703``
1523      ``AMD:AMDGPU:7:0:4``  ``gfx704``
1524      ``AMD:AMDGPU:7:0:5``  ``gfx705``
1525      ``AMD:AMDGPU:8:0:0``  ``gfx802``
1526      ``AMD:AMDGPU:8:0:1``  ``gfx801:xnack+``
1527      ``AMD:AMDGPU:8:0:2``  ``gfx802``
1528      ``AMD:AMDGPU:8:0:3``  ``gfx803``
1529      ``AMD:AMDGPU:8:0:4``  ``gfx803``
1530      ``AMD:AMDGPU:8:0:5``  ``gfx805``
1531      ``AMD:AMDGPU:8:1:0``  ``gfx810:xnack+``
1532      ``AMD:AMDGPU:9:0:0``  ``gfx900:xnack-``
1533      ``AMD:AMDGPU:9:0:1``  ``gfx900:xnack+``
1534      ``AMD:AMDGPU:9:0:2``  ``gfx902:xnack-``
1535      ``AMD:AMDGPU:9:0:3``  ``gfx902:xnack+``
1536      ``AMD:AMDGPU:9:0:4``  ``gfx904:xnack-``
1537      ``AMD:AMDGPU:9:0:5``  ``gfx904:xnack+``
1538      ``AMD:AMDGPU:9:0:6``  ``gfx906:sramecc-:xnack-``
1539      ``AMD:AMDGPU:9:0:7``  ``gfx906:sramecc-:xnack+``
1540      ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1541      ===================== ==========================
1542
1543 .. _amdgpu-note-records-v3-onwards:
1544
1545 Code Object V3 and Above Note Records
1546 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1547
1548 The AMDGPU backend code object uses the following ELF note record in the
1549 ``.note`` section when compiling for code object V3 and above.
1550
1551 The note record vendor field is "AMDGPU".
1552
1553 Additional note records may be present, but any which are not documented here
1554 are deprecated and should not be used.
1555
1556   .. table:: AMDGPU Code Object V3 and Above ELF Note Records
1557      :name: amdgpu-elf-note-records-table-v3-onwards
1558
1559      ======== ============================== ======================================
1560      Name     Type                           Description
1561      ======== ============================== ======================================
1562      "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
1563                                              binary format.
1564      ======== ============================== ======================================
1565
1566 ..
1567
1568   .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
1569      :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
1570
1571      ============================== =====
1572      Name                           Value
1573      ============================== =====
1574      *reserved*                     0-31
1575      ``NT_AMDGPU_METADATA``         32
1576      ============================== =====
1577
1578 ``NT_AMDGPU_METADATA``
1579   Specifies extensible metadata associated with an AMDGPU code object. It is
1580   encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1581   :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
1582   :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
1583   :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
1584   ``amdhsa`` OS.
1585
1586 .. _amdgpu-symbols:
1587
1588 Symbols
1589 -------
1590
1591 Symbols include the following:
1592
1593   .. table:: AMDGPU ELF Symbols
1594      :name: amdgpu-elf-symbols-table
1595
1596      ===================== ================== ================ ==================
1597      Name                  Type               Section          Description
1598      ===================== ================== ================ ==================
1599      *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
1600                                               - ``.rodata``
1601                                               - ``.bss``
1602      *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
1603      *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
1604      *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
1605      ===================== ================== ================ ==================
1606
1607 Global variable
1608   Global variables both used and defined by the compilation unit.
1609
1610   If the symbol is defined in the compilation unit then it is allocated in the
1611   appropriate section according to if it has initialized data or is readonly.
1612
1613   If the symbol is external then its section is ``STN_UNDEF`` and the loader
1614   will resolve relocations using the definition provided by another code object
1615   or explicitly defined by the runtime.
1616
1617   If the symbol resides in local/group memory (LDS) then its section is the
1618   special processor specific section name ``SHN_AMDGPU_LDS``, and the
1619   ``st_value`` field describes alignment requirements as it does for common
1620   symbols.
1621
1622   .. TODO::
1623
1624      Add description of linked shared object symbols. Seems undefined symbols
1625      are marked as STT_NOTYPE.
1626
1627 Kernel descriptor
1628   Every HSA kernel has an associated kernel descriptor. It is the address of the
1629   kernel descriptor that is used in the AQL dispatch packet used to invoke the
1630   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1631   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1632
1633 Kernel entry point
1634   Every HSA kernel also has a symbol for its machine code entry point.
1635
1636 .. _amdgpu-relocation-records:
1637
1638 Relocation Records
1639 ------------------
1640
1641 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1642 relocatable fields are:
1643
1644 ``word32``
1645   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1646   alignment. These values use the same byte order as other word values in the
1647   AMDGPU architecture.
1648
1649 ``word64``
1650   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1651   alignment. These values use the same byte order as other word values in the
1652   AMDGPU architecture.
1653
1654 Following notations are used for specifying relocation calculations:
1655
1656 **A**
1657   Represents the addend used to compute the value of the relocatable field.
1658
1659 **G**
1660   Represents the offset into the global offset table at which the relocation
1661   entry's symbol will reside during execution.
1662
1663 **GOT**
1664   Represents the address of the global offset table.
1665
1666 **P**
1667   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1668   of the storage unit being relocated (computed using ``r_offset``).
1669
1670 **S**
1671   Represents the value of the symbol whose index resides in the relocation
1672   entry. Relocations not using this must specify a symbol index of
1673   ``STN_UNDEF``.
1674
1675 **B**
1676   Represents the base address of a loaded executable or shared object which is
1677   the difference between the ELF address and the actual load address.
1678   Relocations using this are only valid in executable or shared objects.
1679
1680 The following relocation types are supported:
1681
1682   .. table:: AMDGPU ELF Relocation Records
1683      :name: amdgpu-elf-relocation-records-table
1684
1685      ========================== ======= =====  ==========  ==============================
1686      Relocation Type            Kind    Value  Field       Calculation
1687      ========================== ======= =====  ==========  ==============================
1688      ``R_AMDGPU_NONE``                  0      *none*      *none*
1689      ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
1690                                 Dynamic
1691      ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
1692                                 Dynamic
1693      ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
1694                                 Dynamic
1695      ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
1696      ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
1697      ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
1698                                 Dynamic
1699      ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
1700      ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
1701      ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
1702      ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
1703      ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
1704      *reserved*                         12
1705      ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
1706      ``R_AMDGPU_REL16``         Static  14     ``word16``  ((S + A - P) - 4) / 4
1707      ========================== ======= =====  ==========  ==============================
1708
1709 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1710 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1711
1712 There is no current OS loader support for 32-bit programs and so
1713 ``R_AMDGPU_ABS32`` is not used.
1714
1715 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1716
1717 Loaded Code Object Path Uniform Resource Identifier (URI)
1718 ---------------------------------------------------------
1719
1720 The AMD GPU code object loader represents the path of the ELF shared object from
1721 which the code object was loaded as a textual Uniform Resource Identifier (URI).
1722 Note that the code object is the in memory loaded relocated form of the ELF
1723 shared object.  Multiple code objects may be loaded at different memory
1724 addresses in the same process from the same ELF shared object.
1725
1726 The loaded code object path URI syntax is defined by the following BNF syntax:
1727
1728 .. code::
1729
1730   code_object_uri ::== file_uri | memory_uri
1731   file_uri        ::== "file://" file_path [ range_specifier ]
1732   memory_uri      ::== "memory://" process_id range_specifier
1733   range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1734   file_path       ::== URI_ENCODED_OS_FILE_PATH
1735   process_id      ::== DECIMAL_NUMBER
1736   number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1737
1738 **number**
1739   Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1740   and octal values by "0".
1741
1742 **file_path**
1743   Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1744   every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1745   encoded as two uppercase hexadecimal digits proceeded by "%".  Directories in
1746   the path are separated by "/".
1747
1748 **offset**
1749   Is a 0-based byte offset to the start of the code object.  For a file URI, it
1750   is from the start of the file specified by the ``file_path``, and if omitted
1751   defaults to 0. For a memory URI, it is the memory address and is required.
1752
1753 **size**
1754   Is the number of bytes in the code object.  For a file URI, if omitted it
1755   defaults to the size of the file.  It is required for a memory URI.
1756
1757 **process_id**
1758   Is the identity of the process owning the memory.  For Linux it is the C
1759   unsigned integral decimal literal for the process ID (PID).
1760
1761 For example:
1762
1763 .. code::
1764
1765   file:///dir1/dir2/file1
1766   file:///dir3/dir4/file2#offset=0x2000&size=3000
1767   memory://1234#offset=0x20000&size=3000
1768
1769 .. _amdgpu-dwarf-debug-information:
1770
1771 DWARF Debug Information
1772 =======================
1773
1774 .. warning::
1775
1776    This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1777    is not currently fully implemented and is subject to change.
1778
1779 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1780 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
1781 object executable code and data to the source language constructs. It can be
1782 used by tools such as debuggers and profilers. It uses features defined in
1783 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1784 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1785
1786 This section defines the AMDGPU target architecture specific DWARF mappings.
1787
1788 .. _amdgpu-dwarf-register-identifier:
1789
1790 Register Identifier
1791 -------------------
1792
1793 This section defines the AMDGPU target architecture register numbers used in
1794 DWARF operation expressions (see DWARF Version 5 section 2.5 and
1795 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1796 instructions (see DWARF Version 5 section 6.4 and
1797 :ref:`amdgpu-dwarf-call-frame-information`).
1798
1799 A single code object can contain code for kernels that have different wavefront
1800 sizes. The vector registers and some scalar registers are based on the wavefront
1801 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1802 simplifies the consumer of the DWARF so that each register has a fixed size,
1803 rather than being dynamic according to the wavefront size mode. Similarly,
1804 distinct DWARF registers are defined for those registers that vary in size
1805 according to the process address size. This allows a consumer to treat a
1806 specific AMDGPU processor as a single architecture regardless of how it is
1807 configured at run time. The compiler explicitly specifies the DWARF registers
1808 that match the mode in which the code it is generating will be executed.
1809
1810 DWARF registers are encoded as numbers, which are mapped to architecture
1811 registers. The mapping for AMDGPU is defined in
1812 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1813 mapping.
1814
1815 .. table:: AMDGPU DWARF Register Mapping
1816    :name: amdgpu-dwarf-register-mapping-table
1817
1818    ============== ================= ======== ==================================
1819    DWARF Register AMDGPU Register   Bit Size Description
1820    ============== ================= ======== ==================================
1821    0              PC_32             32       Program Counter (PC) when
1822                                              executing in a 32-bit process
1823                                              address space. Used in the CFI to
1824                                              describe the PC of the calling
1825                                              frame.
1826    1              EXEC_MASK_32      32       Execution Mask Register when
1827                                              executing in wavefront 32 mode.
1828    2-15           *Reserved*                 *Reserved for highly accessed
1829                                              registers using DWARF shortcut.*
1830    16             PC_64             64       Program Counter (PC) when
1831                                              executing in a 64-bit process
1832                                              address space. Used in the CFI to
1833                                              describe the PC of the calling
1834                                              frame.
1835    17             EXEC_MASK_64      64       Execution Mask Register when
1836                                              executing in wavefront 64 mode.
1837    18-31          *Reserved*                 *Reserved for highly accessed
1838                                              registers using DWARF shortcut.*
1839    32-95          SGPR0-SGPR63      32       Scalar General Purpose
1840                                              Registers.
1841    96-127         *Reserved*                 *Reserved for frequently accessed
1842                                              registers using DWARF 1-byte ULEB.*
1843    128            STATUS            32       Status Register.
1844    129-511        *Reserved*                 *Reserved for future Scalar
1845                                              Architectural Registers.*
1846    512            VCC_32            32       Vector Condition Code Register
1847                                              when executing in wavefront 32
1848                                              mode.
1849    513-767        *Reserved*                 *Reserved for future Vector
1850                                              Architectural Registers when
1851                                              executing in wavefront 32 mode.*
1852    768            VCC_64            64       Vector Condition Code Register
1853                                              when executing in wavefront 64
1854                                              mode.
1855    769-1023       *Reserved*                 *Reserved for future Vector
1856                                              Architectural Registers when
1857                                              executing in wavefront 64 mode.*
1858    1024-1087      *Reserved*                 *Reserved for padding.*
1859    1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers.
1860    1130-1535      *Reserved*                 *Reserved for future Scalar
1861                                              General Purpose Registers.*
1862    1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
1863                                              when executing in wavefront 32
1864                                              mode.
1865    1792-2047      *Reserved*                 *Reserved for future Vector
1866                                              General Purpose Registers when
1867                                              executing in wavefront 32 mode.*
1868    2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
1869                                              when executing in wavefront 32
1870                                              mode.
1871    2304-2559      *Reserved*                 *Reserved for future Vector
1872                                              Accumulation Registers when
1873                                              executing in wavefront 32 mode.*
1874    2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
1875                                              when executing in wavefront 64
1876                                              mode.
1877    2816-3071      *Reserved*                 *Reserved for future Vector
1878                                              General Purpose Registers when
1879                                              executing in wavefront 64 mode.*
1880    3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
1881                                              when executing in wavefront 64
1882                                              mode.
1883    3328-3583      *Reserved*                 *Reserved for future Vector
1884                                              Accumulation Registers when
1885                                              executing in wavefront 64 mode.*
1886    ============== ================= ======== ==================================
1887
1888 The vector registers are represented as the full size for the wavefront. They
1889 are organized as consecutive dwords (32-bits), one per lane, with the dword at
1890 the least significant bit position corresponding to lane 0 and so forth. DWARF
1891 location expressions involving the ``DW_OP_LLVM_offset`` and
1892 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1893 register corresponding to the lane that is executing the current thread of
1894 execution in languages that are implemented using a SIMD or SIMT execution
1895 model.
1896
1897 If the wavefront size is 32 lanes then the wavefront 32 mode register
1898 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1899 mode register definitions are used. Some AMDGPU targets support executing in
1900 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1901 to the wavefront mode of the generated code will be used.
1902
1903 If code is generated to execute in a 32-bit process address space, then the
1904 32-bit process address space register definitions are used. If code is generated
1905 to execute in a 64-bit process address space, then the 64-bit process address
1906 space register definitions are used. The ``amdgcn`` target only supports the
1907 64-bit process address space.
1908
1909 .. _amdgpu-dwarf-memory-space-identifier:
1910
1911 Memory Space Identifier
1912 -----------------------
1913
1914 The DWARF memory space represents the source language memory space. See DWARF
1915 Version 5 section 2.12 which is updated by the *DWARF Extensions For
1916 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-memory-spaces`.
1917
1918 The DWARF memory space mapping used for AMDGPU is defined in
1919 :ref:`amdgpu-dwarf-memory-space-mapping-table`.
1920
1921 .. table:: AMDGPU DWARF Memory Space Mapping
1922    :name: amdgpu-dwarf-memory-space-mapping-table
1923
1924    =========================== ====== =================
1925    DWARF                              AMDGPU
1926    ---------------------------------- -----------------
1927    Memory Space Name           Value  Memory Space
1928    =========================== ====== =================
1929    ``DW_MSPACE_LLVM_none``     0x0000 Generic (Flat)
1930    ``DW_MSPACE_LLVM_global``   0x0001 Global
1931    ``DW_MSPACE_LLVM_constant`` 0x0002 Global
1932    ``DW_MSPACE_LLVM_group``    0x0003 Local (group/LDS)
1933    ``DW_MSPACE_LLVM_private``  0x0004 Private (Scratch)
1934    ``DW_MSPACE_AMDGPU_region`` 0x8000 Region (GDS)
1935    =========================== ====== =================
1936
1937 The DWARF memory space values defined in the *DWARF Extensions For Heterogeneous
1938 Debugging* section :ref:`amdgpu-dwarf-memory-spaces` are used.
1939
1940 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1941 available for use for the AMD extension for access to the hardware GDS memory
1942 which is scratchpad memory allocated per device.
1943
1944 For AMDGPU if no ``DW_AT_LLVM_memory_space`` attribute is present, then the
1945 default memory space of ``DW_MSPACE_LLVM_none`` is used.
1946
1947 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1948 mapping of DWARF memory spaces to DWARF address spaces, including address size
1949 and NULL value.
1950
1951 .. _amdgpu-dwarf-address-space-identifier:
1952
1953 Address Space Identifier
1954 ------------------------
1955
1956 DWARF address spaces correspond to target architecture specific linear
1957 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
1958 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-address-spaces`.
1959
1960 The DWARF address space mapping used for AMDGPU is defined in
1961 :ref:`amdgpu-dwarf-address-space-mapping-table`.
1962
1963 .. table:: AMDGPU DWARF Address Space Mapping
1964    :name: amdgpu-dwarf-address-space-mapping-table
1965
1966    ======================================= ===== ======= ======== ===================== =======================
1967    DWARF                                                          AMDGPU                Notes
1968    --------------------------------------- ----- ---------------- --------------------- -----------------------
1969    Address Space Name                      Value Address Bit Size LLVM IR Address Space
1970    --------------------------------------- ----- ------- -------- --------------------- -----------------------
1971    ..                                            64-bit  32-bit
1972                                                  process process
1973                                                  address address
1974                                                  space   space
1975    ======================================= ===== ======= ======== ===================== =======================
1976    ``DW_ASPACE_LLVM_none``                 0x00  64      32       Global                *default address space*
1977    ``DW_ASPACE_AMDGPU_generic``            0x01  64      32       Generic (Flat)
1978    ``DW_ASPACE_AMDGPU_region``             0x02  32      32       Region (GDS)
1979    ``DW_ASPACE_AMDGPU_local``              0x03  32      32       Local (group/LDS)
1980    *Reserved*                              0x04
1981    ``DW_ASPACE_AMDGPU_private_lane``       0x05  32      32       Private (Scratch)     *focused lane*
1982    ``DW_ASPACE_AMDGPU_private_wave``       0x06  32      32       Private (Scratch)     *unswizzled wavefront*
1983    ======================================= ===== ======= ======== ===================== =======================
1984
1985 See :ref:`amdgpu-address-spaces` for information on the AMDGPU LLVM IR address
1986 spaces including address size and NULL value.
1987
1988 The ``DW_ASPACE_LLVM_none`` address space is the default target architecture
1989 address space used in DWARF operations that do not specify an address space. It
1990 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1991 related operations can refer to addresses in the program code.
1992
1993 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1994 specify the flat address space. If the address corresponds to an address in the
1995 local address space, then it corresponds to the wavefront that is executing the
1996 focused thread of execution. If the address corresponds to an address in the
1997 private address space, then it corresponds to the lane that is executing the
1998 focused thread of execution for languages that are implemented using a SIMD or
1999 SIMT execution model.
2000
2001 .. note::
2002
2003   CUDA-like languages such as HIP that do not have address spaces in the
2004   language type system, but do allow variables to be allocated in different
2005   address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
2006   address space in the DWARF expression operations as the default address space
2007   is the global address space.
2008
2009 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2010 specify the local address space corresponding to the wavefront that is executing
2011 the focused thread of execution.
2012
2013 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2014 to specify the private address space corresponding to the lane that is executing
2015 the focused thread of execution for languages that are implemented using a SIMD
2016 or SIMT execution model.
2017
2018 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2019 to specify the unswizzled private address space corresponding to the wavefront
2020 that is executing the focused thread of execution. The wavefront view of private
2021 memory is the per wavefront unswizzled backing memory layout defined in
2022 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2023 location for the backing memory of the wavefront (namely the address is not
2024 offset by ``wavefront-scratch-base``). The following formula can be used to
2025 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2026 ``DW_ASPACE_AMDGPU_private_wave`` address:
2027
2028 ::
2029
2030   private-address-wavefront =
2031     ((private-address-lane / 4) * wavefront-size * 4) +
2032     (wavefront-lane-id * 4) + (private-address-lane % 4)
2033
2034 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2035 of the dwords for each lane starting with lane 0 is required, then this
2036 simplifies to:
2037
2038 ::
2039
2040   private-address-wavefront =
2041     private-address-lane * wavefront-size
2042
2043 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2044 complete spilled vector register back into a complete vector register in the
2045 CFI. The frame pointer can be a private lane address which is dword aligned,
2046 which can be shifted to multiply by the wavefront size, and then used to form a
2047 private wavefront address that gives a location for a contiguous set of dwords,
2048 one per lane, where the vector register dwords are spilled. The compiler knows
2049 the wavefront size since it generates the code. Note that the type of the
2050 address may have to be converted as the size of a
2051 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2052 ``DW_ASPACE_AMDGPU_private_wave`` address.
2053
2054 .. _amdgpu-dwarf-lane-identifier:
2055
2056 Lane identifier
2057 ---------------
2058
2059 DWARF lane identifies specify a target architecture lane position for hardware
2060 that executes in a SIMD or SIMT manner, and on which a source language maps its
2061 threads of execution onto those lanes. The DWARF lane identifier is pushed by
2062 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2063 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2064 section :ref:`amdgpu-dwarf-operation-expressions`.
2065
2066 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2067 wavefront. It is numbered from 0 to the wavefront size minus 1.
2068
2069 Operation Expressions
2070 ---------------------
2071
2072 DWARF expressions are used to compute program values and the locations of
2073 program objects. See DWARF Version 5 section 2.5 and
2074 :ref:`amdgpu-dwarf-operation-expressions`.
2075
2076 DWARF location descriptions describe how to access storage which includes memory
2077 and registers. When accessing storage on AMDGPU, bytes are ordered with least
2078 significant bytes first, and bits are ordered within bytes with least
2079 significant bits first.
2080
2081 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2082 unwinding vector registers that are spilled under the execution mask to memory:
2083 the zero-single location description is the vector register, and the one-single
2084 location description is the spilled memory location description. The
2085 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2086 memory location description.
2087
2088 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2089 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2090 controlled by the execution mask. An undefined location description together
2091 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2092 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2093
2094 Debugger Information Entry Attributes
2095 -------------------------------------
2096
2097 This section describes how certain debugger information entry attributes are
2098 used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2099 which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2100 :ref:`amdgpu-dwarf-low-level-information` and
2101 :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2102
2103 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2104
2105 ``DW_AT_LLVM_lane_pc``
2106 ~~~~~~~~~~~~~~~~~~~~~~
2107
2108 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2109 location of the separate lanes of a SIMT thread.
2110
2111 If the lane is an active lane then this will be the same as the current program
2112 location.
2113
2114 If the lane is inactive, but was active on entry to the subprogram, then this is
2115 the program location in the subprogram at which execution of the lane is
2116 conceptual positioned.
2117
2118 If the lane was not active on entry to the subprogram, then this will be the
2119 undefined location. A client debugger can check if the lane is part of a valid
2120 work-group by checking that the lane is in the range of the associated
2121 work-group within the grid, accounting for partial work-groups. If it is not,
2122 then the debugger can omit any information for the lane. Otherwise, the debugger
2123 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2124 calling subprogram until it finds a non-undefined location. Conceptually the
2125 lane only has the call frames that it has a non-undefined
2126 ``DW_AT_LLVM_lane_pc``.
2127
2128 The following example illustrates how the AMDGPU backend can generate a DWARF
2129 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2130 following subprogram pseudo code for a target with 64 lanes per wavefront.
2131
2132 .. code::
2133   :number-lines:
2134
2135   SUBPROGRAM X
2136   BEGIN
2137     a;
2138     IF (c1) THEN
2139       b;
2140       IF (c2) THEN
2141         c;
2142       ELSE
2143         d;
2144       ENDIF
2145       e;
2146     ELSE
2147       f;
2148     ENDIF
2149     g;
2150   END
2151
2152 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2153 execution mask (``EXEC``) to linearize the control flow. The condition is
2154 evaluated to make a mask of the lanes for which the condition evaluates to true.
2155 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2156 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2157 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2158 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2159 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2160 region. This is shown below. Other approaches are possible, but the basic
2161 concept is the same.
2162
2163 .. code::
2164   :number-lines:
2165
2166   $lex_start:
2167     a;
2168     %1 = EXEC
2169     %2 = c1
2170   $lex_1_start:
2171     EXEC = %1 & %2
2172   $if_1_then:
2173       b;
2174       %3 = EXEC
2175       %4 = c2
2176   $lex_1_1_start:
2177       EXEC = %3 & %4
2178   $lex_1_1_then:
2179         c;
2180       EXEC = ~EXEC & %3
2181   $lex_1_1_else:
2182         d;
2183       EXEC = %3
2184   $lex_1_1_end:
2185       e;
2186     EXEC = ~EXEC & %1
2187   $lex_1_else:
2188       f;
2189     EXEC = %1
2190   $lex_1_end:
2191     g;
2192   $lex_end:
2193
2194 To create the DWARF location list expression that defines the location
2195 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2196 pseudo instruction can be used to annotate the linearized control flow. This can
2197 be done by defining an artificial variable for the lane PC. The DWARF location
2198 list expression created for it is used as the value of the
2199 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2200
2201 A DWARF procedure is defined for each well nested structured control flow region
2202 which provides the conceptual lane program location for a lane if it is not
2203 active (namely it is divergent). The DWARF operation expression for each region
2204 conceptually inherits the value of the immediately enclosing region and modifies
2205 it according to the semantics of the region.
2206
2207 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2208 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2209 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2210 region since the ``THEN`` region has completed.
2211
2212 The lane PC artificial variable is assigned at each region transition. It uses
2213 the immediately enclosing region's DWARF procedure to compute the program
2214 location for each lane assuming they are divergent, and then modifies the result
2215 by inserting the current program location for each lane that the ``EXEC`` mask
2216 indicates is active.
2217
2218 By having separate DWARF procedures for each region, they can be reused to
2219 define the value for any nested region. This reduces the total size of the DWARF
2220 operation expressions.
2221
2222 The following provides an example using pseudo LLVM MIR.
2223
2224 .. code::
2225   :number-lines:
2226
2227   $lex_start:
2228     DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2229       DW_AT_name = "__uint64";
2230       DW_AT_byte_size = 8;
2231       DW_AT_encoding = DW_ATE_unsigned;
2232     ];
2233     DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2234       DW_AT_name = "__active_lane_pc";
2235       DW_AT_location = [
2236         DW_OP_regx PC;
2237         DW_OP_LLVM_extend 64, 64;
2238         DW_OP_regval_type EXEC, %uint_64;
2239         DW_OP_LLVM_select_bit_piece 64, 64;
2240       ];
2241     ];
2242     DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2243       DW_AT_name = "__divergent_lane_pc";
2244       DW_AT_location = [
2245         DW_OP_LLVM_undefined;
2246         DW_OP_LLVM_extend 64, 64;
2247       ];
2248     ];
2249     DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2250       DW_OP_call_ref %__divergent_lane_pc;
2251       DW_OP_call_ref %__active_lane_pc;
2252     ];
2253     a;
2254     %1 = EXEC;
2255     DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2256     %2 = c1;
2257   $lex_1_start:
2258     EXEC = %1 & %2;
2259   $lex_1_then:
2260       DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2261         DW_AT_name = "__divergent_lane_pc_1_then";
2262         DW_AT_location = DIExpression[
2263           DW_OP_call_ref %__divergent_lane_pc;
2264           DW_OP_addrx &lex_1_start;
2265           DW_OP_stack_value;
2266           DW_OP_LLVM_extend 64, 64;
2267           DW_OP_call_ref %__lex_1_save_exec;
2268           DW_OP_deref_type 64, %__uint_64;
2269           DW_OP_LLVM_select_bit_piece 64, 64;
2270         ];
2271       ];
2272       DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2273         DW_OP_call_ref %__divergent_lane_pc_1_then;
2274         DW_OP_call_ref %__active_lane_pc;
2275       ];
2276       b;
2277       %3 = EXEC;
2278       DBG_VALUE %3, %__lex_1_1_save_exec;
2279       %4 = c2;
2280   $lex_1_1_start:
2281       EXEC = %3 & %4;
2282   $lex_1_1_then:
2283         DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2284           DW_AT_name = "__divergent_lane_pc_1_1_then";
2285           DW_AT_location = DIExpression[
2286             DW_OP_call_ref %__divergent_lane_pc_1_then;
2287             DW_OP_addrx &lex_1_1_start;
2288             DW_OP_stack_value;
2289             DW_OP_LLVM_extend 64, 64;
2290             DW_OP_call_ref %__lex_1_1_save_exec;
2291             DW_OP_deref_type 64, %__uint_64;
2292             DW_OP_LLVM_select_bit_piece 64, 64;
2293           ];
2294         ];
2295         DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2296           DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2297           DW_OP_call_ref %__active_lane_pc;
2298         ];
2299         c;
2300       EXEC = ~EXEC & %3;
2301   $lex_1_1_else:
2302         DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2303           DW_AT_name = "__divergent_lane_pc_1_1_else";
2304           DW_AT_location = DIExpression[
2305             DW_OP_call_ref %__divergent_lane_pc_1_then;
2306             DW_OP_addrx &lex_1_1_end;
2307             DW_OP_stack_value;
2308             DW_OP_LLVM_extend 64, 64;
2309             DW_OP_call_ref %__lex_1_1_save_exec;
2310             DW_OP_deref_type 64, %__uint_64;
2311             DW_OP_LLVM_select_bit_piece 64, 64;
2312           ];
2313         ];
2314         DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2315           DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2316           DW_OP_call_ref %__active_lane_pc;
2317         ];
2318         d;
2319       EXEC = %3;
2320   $lex_1_1_end:
2321       DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2322         DW_OP_call_ref %__divergent_lane_pc;
2323         DW_OP_call_ref %__active_lane_pc;
2324       ];
2325       e;
2326     EXEC = ~EXEC & %1;
2327   $lex_1_else:
2328       DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2329         DW_AT_name = "__divergent_lane_pc_1_else";
2330         DW_AT_location = DIExpression[
2331           DW_OP_call_ref %__divergent_lane_pc;
2332           DW_OP_addrx &lex_1_end;
2333           DW_OP_stack_value;
2334           DW_OP_LLVM_extend 64, 64;
2335           DW_OP_call_ref %__lex_1_save_exec;
2336           DW_OP_deref_type 64, %__uint_64;
2337           DW_OP_LLVM_select_bit_piece 64, 64;
2338         ];
2339       ];
2340       DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2341         DW_OP_call_ref %__divergent_lane_pc_1_else;
2342         DW_OP_call_ref %__active_lane_pc;
2343       ];
2344       f;
2345     EXEC = %1;
2346   $lex_1_end:
2347     DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2348       DW_OP_call_ref %__divergent_lane_pc;
2349       DW_OP_call_ref %__active_lane_pc;
2350     ];
2351     g;
2352   $lex_end:
2353
2354 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2355 that are active, with the current program location.
2356
2357 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2358 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2359 instruction, location list entries will be created that describe where the
2360 artificial variables are allocated at any given program location. The compiler
2361 may allocate them to registers or spill them to memory.
2362
2363 The DWARF procedures for each region use the values of the saved execution mask
2364 artificial variables to only update the lanes that are active on entry to the
2365 region. All other lanes retain the value of the enclosing region where they were
2366 last active. If they were not active on entry to the subprogram, then will have
2367 the undefined location description.
2368
2369 Other structured control flow regions can be handled similarly. For example,
2370 loops would set the divergent program location for the region at the end of the
2371 loop. Any lanes active will be in the loop, and any lanes not active must have
2372 exited the loop.
2373
2374 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2375 ``IF/THEN/ELSE`` regions.
2376
2377 The DWARF procedures can use the active lane artificial variable described in
2378 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2379 ``EXEC`` mask in order to support whole or quad wavefront mode.
2380
2381 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2382
2383 ``DW_AT_LLVM_active_lane``
2384 ~~~~~~~~~~~~~~~~~~~~~~~~~~
2385
2386 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2387 entry is used to specify the lanes that are conceptually active for a SIMT
2388 thread.
2389
2390 The execution mask may be modified to implement whole or quad wavefront mode
2391 operations. For example, all lanes may need to temporarily be made active to
2392 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2393 update it to enable the necessary lanes, perform the operations, and then
2394 restore the ``EXEC`` mask from the saved value. While executing the whole
2395 wavefront region, the conceptual execution mask is the saved value, not the
2396 ``EXEC`` value.
2397
2398 This is handled by defining an artificial variable for the active lane mask. The
2399 active lane mask artificial variable would be the actual ``EXEC`` mask for
2400 normal regions, and the saved execution mask for regions where the mask is
2401 temporarily updated. The location list expression created for this artificial
2402 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2403 attribute.
2404
2405 ``DW_AT_LLVM_augmentation``
2406 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2407
2408 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2409 debugger information entry has the following value for the augmentation string:
2410
2411 ::
2412
2413   [amdgpu:v0.0]
2414
2415 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2416 extensions used in the DWARF of the compilation unit. The version number
2417 conforms to [SEMVER]_.
2418
2419 Call Frame Information
2420 ----------------------
2421
2422 DWARF Call Frame Information (CFI) describes how a consumer can virtually
2423 *unwind* call frames in a running process or core dump. See DWARF Version 5
2424 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2425
2426 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2427
2428 1.  ``augmentation`` string contains the following null-terminated UTF-8 string:
2429
2430     ::
2431
2432       [amd:v0.0]
2433
2434     The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2435     extensions used in this CIE or to the FDEs that use it. The version number
2436     conforms to [SEMVER]_.
2437
2438 2.  ``address_size`` for the ``Global`` address space is defined in
2439     :ref:`amdgpu-dwarf-address-space-identifier`.
2440
2441 3.  ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2442
2443 4.  ``code_alignment_factor`` is 4 bytes.
2444
2445     .. TODO::
2446
2447        Add to :ref:`amdgpu-processor-table` table.
2448
2449 5.  ``data_alignment_factor`` is 4 bytes.
2450
2451     .. TODO::
2452
2453        Add to :ref:`amdgpu-processor-table` table.
2454
2455 6.  ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2456     for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2457
2458 7.  ``initial_instructions`` Since a subprogram X with fewer registers can be
2459     called from subprogram Y that has more allocated, X will not change any of
2460     the extra registers as it cannot access them. Therefore, the default rule
2461     for all columns is ``same value``.
2462
2463 For AMDGPU the register number follows the numbering defined in
2464 :ref:`amdgpu-dwarf-register-identifier`.
2465
2466 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2467 the return address to get the address of a byte within the call site
2468 instructions. See DWARF Version 5 section 6.4.4.
2469
2470 Accelerated Access
2471 ------------------
2472
2473 See DWARF Version 5 section 6.1.
2474
2475 Lookup By Name Section Header
2476 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2477
2478 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2479
2480 For AMDGPU the lookup by name section header table:
2481
2482 ``augmentation_string_size`` (uword)
2483
2484   Set to the length of the ``augmentation_string`` value which is always a
2485   multiple of 4.
2486
2487 ``augmentation_string`` (sequence of UTF-8 characters)
2488
2489   Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2490
2491   ::
2492
2493     [amdgpu:v0.0]
2494
2495   The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2496   extensions used in the DWARF of this index. The version number conforms to
2497   [SEMVER]_.
2498
2499   .. note::
2500
2501     This is different to the DWARF Version 5 definition that requires the first
2502     4 characters to be the vendor ID. But this is consistent with the other
2503     augmentation strings and does allow multiple vendor contributions. However,
2504     backwards compatibility may be more desirable.
2505
2506 Lookup By Address Section Header
2507 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2508
2509 See DWARF Version 5 section 6.1.2.
2510
2511 For AMDGPU the lookup by address section header table:
2512
2513 ``address_size`` (ubyte)
2514
2515   Match the address size for the ``Global`` address space defined in
2516   :ref:`amdgpu-dwarf-address-space-identifier`.
2517
2518 ``segment_selector_size`` (ubyte)
2519
2520   AMDGPU does not use a segment selector so this is 0. The entries in the
2521   ``.debug_aranges`` do not have a segment selector.
2522
2523 Line Number Information
2524 -----------------------
2525
2526 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2527
2528 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2529 The instruction set must be obtained from the ELF file header ``e_flags`` field
2530 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2531 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2532
2533 .. TODO::
2534
2535   Should the ``isa`` state machine register be used to indicate if the code is
2536   in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2537
2538 For AMDGPU the line number program header fields have the following values (see
2539 DWARF Version 5 section 6.2.4):
2540
2541 ``address_size`` (ubyte)
2542   Matches the address size for the ``Global`` address space defined in
2543   :ref:`amdgpu-dwarf-address-space-identifier`.
2544
2545 ``segment_selector_size`` (ubyte)
2546   AMDGPU does not use a segment selector so this is 0.
2547
2548 ``minimum_instruction_length`` (ubyte)
2549   For GFX9-GFX11 this is 4.
2550
2551 ``maximum_operations_per_instruction`` (ubyte)
2552   For GFX9-GFX11 this is 1.
2553
2554 Source text for online-compiled programs (for example, those compiled by the
2555 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2556 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2557 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2558 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2559
2560 The Clang option used to control source embedding in AMDGPU is defined in
2561 :ref:`amdgpu-clang-debug-options-table`.
2562
2563   .. table:: AMDGPU Clang Debug Options
2564      :name: amdgpu-clang-debug-options-table
2565
2566      ==================== ==================================================
2567      Debug Flag           Description
2568      ==================== ==================================================
2569      -g[no-]embed-source  Enable/disable embedding source text in DWARF
2570                           debug sections. Useful for environments where
2571                           source cannot be written to disk, such as
2572                           when performing online compilation.
2573      ==================== ==================================================
2574
2575 For example:
2576
2577 ``-gembed-source``
2578   Enable the embedded source.
2579
2580 ``-gno-embed-source``
2581   Disable the embedded source.
2582
2583 32-Bit and 64-Bit DWARF Formats
2584 -------------------------------
2585
2586 See DWARF Version 5 section 7.4 and
2587 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2588
2589 For AMDGPU:
2590
2591 * For the ``amdgcn`` target architecture only the 64-bit process address space
2592   is supported.
2593
2594 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2595   the 32-bit DWARF format.
2596
2597 Unit Headers
2598 ------------
2599
2600 For AMDGPU the following values apply for each of the unit headers described in
2601 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2602
2603 ``address_size`` (ubyte)
2604   Matches the address size for the ``Global`` address space defined in
2605   :ref:`amdgpu-dwarf-address-space-identifier`.
2606
2607 .. _amdgpu-code-conventions:
2608
2609 Code Conventions
2610 ================
2611
2612 This section provides code conventions used for each supported target triple OS
2613 (see :ref:`amdgpu-target-triples`).
2614
2615 AMDHSA
2616 ------
2617
2618 This section provides code conventions used when the target triple OS is
2619 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
2620
2621 .. _amdgpu-amdhsa-code-object-metadata:
2622
2623 Code Object Metadata
2624 ~~~~~~~~~~~~~~~~~~~~
2625
2626 The code object metadata specifies extensible metadata associated with the code
2627 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2628 encoding and semantics of this metadata depends on the code object version; see
2629 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2630 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2631 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2632 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
2633
2634 Code object metadata is specified in a note record (see
2635 :ref:`amdgpu-note-records`) and is required when the target triple OS is
2636 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2637 information necessary to support the HSA compatible runtime kernel queries. For
2638 example, the segment sizes needed in a dispatch packet. In addition, a
2639 high-level language runtime may require other information to be included. For
2640 example, the AMD OpenCL runtime records kernel argument information.
2641
2642 .. _amdgpu-amdhsa-code-object-metadata-v2:
2643
2644 Code Object V2 Metadata
2645 +++++++++++++++++++++++
2646
2647 .. warning::
2648   Code object V2 is not the default code object version emitted by this version
2649   of LLVM.
2650
2651 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2652 (see :ref:`amdgpu-note-records-v2`).
2653
2654 The metadata is specified as a YAML formatted string (see [YAML]_ and
2655 :doc:`YamlIO`).
2656
2657 .. TODO::
2658
2659   Is the string null terminated? It probably should not if YAML allows it to
2660   contain null characters, otherwise it should be.
2661
2662 The metadata is represented as a single YAML document comprised of the mapping
2663 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2664 referenced tables.
2665
2666 For boolean values, the string values of ``false`` and ``true`` are used for
2667 false and true respectively.
2668
2669 Additional information can be added to the mappings. To avoid conflicts, any
2670 non-AMD key names should be prefixed by "*vendor-name*.".
2671
2672   .. table:: AMDHSA Code Object V2 Metadata Map
2673      :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2674
2675      ========== ============== ========= =======================================
2676      String Key Value Type     Required? Description
2677      ========== ============== ========= =======================================
2678      "Version"  sequence of    Required  - The first integer is the major
2679                 2 integers                 version. Currently 1.
2680                                          - The second integer is the minor
2681                                            version. Currently 0.
2682      "Printf"   sequence of              Each string is encoded information
2683                 strings                  about a printf function call. The
2684                                          encoded information is organized as
2685                                          fields separated by colon (':'):
2686
2687                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2688
2689                                          where:
2690
2691                                          ``ID``
2692                                            A 32-bit integer as a unique id for
2693                                            each printf function call
2694
2695                                          ``N``
2696                                            A 32-bit integer equal to the number
2697                                            of arguments of printf function call
2698                                            minus 1
2699
2700                                          ``S[i]`` (where i = 0, 1, ... , N-1)
2701                                            32-bit integers for the size in bytes
2702                                            of the i-th FormatString argument of
2703                                            the printf function call
2704
2705                                          FormatString
2706                                            The format string passed to the
2707                                            printf function call.
2708      "Kernels"  sequence of    Required  Sequence of the mappings for each
2709                 mapping                  kernel in the code object. See
2710                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2711                                          for the definition of the mapping.
2712      ========== ============== ========= =======================================
2713
2714 ..
2715
2716   .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2717      :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2718
2719      ================= ============== ========= ================================
2720      String Key        Value Type     Required? Description
2721      ================= ============== ========= ================================
2722      "Name"            string         Required  Source name of the kernel.
2723      "SymbolName"      string         Required  Name of the kernel
2724                                                 descriptor ELF symbol.
2725      "Language"        string                   Source language of the kernel.
2726                                                 Values include:
2727
2728                                                 - "OpenCL C"
2729                                                 - "OpenCL C++"
2730                                                 - "HCC"
2731                                                 - "OpenMP"
2732
2733      "LanguageVersion" sequence of              - The first integer is the major
2734                        2 integers                 version.
2735                                                 - The second integer is the
2736                                                   minor version.
2737      "Attrs"           mapping                  Mapping of kernel attributes.
2738                                                 See
2739                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2740                                                 for the mapping definition.
2741      "Args"            sequence of              Sequence of mappings of the
2742                        mapping                  kernel arguments. See
2743                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2744                                                 for the definition of the mapping.
2745      "CodeProps"       mapping                  Mapping of properties related to
2746                                                 the kernel code. See
2747                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2748                                                 for the mapping definition.
2749      ================= ============== ========= ================================
2750
2751 ..
2752
2753   .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2754      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2755
2756      =================== ============== ========= ==============================
2757      String Key          Value Type     Required? Description
2758      =================== ============== ========= ==============================
2759      "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
2760                          3 integers               must be >=1 and the dispatch
2761                                                   work-group size X, Y, Z must
2762                                                   correspond to the specified
2763                                                   values. Defaults to 0, 0, 0.
2764
2765                                                   Corresponds to the OpenCL
2766                                                   ``reqd_work_group_size``
2767                                                   attribute.
2768      "WorkGroupSizeHint" sequence of              The dispatch work-group size
2769                          3 integers               X, Y, Z is likely to be the
2770                                                   specified values.
2771
2772                                                   Corresponds to the OpenCL
2773                                                   ``work_group_size_hint``
2774                                                   attribute.
2775      "VecTypeHint"       string                   The name of a scalar or vector
2776                                                   type.
2777
2778                                                   Corresponds to the OpenCL
2779                                                   ``vec_type_hint`` attribute.
2780
2781      "RuntimeHandle"     string                   The external symbol name
2782                                                   associated with a kernel.
2783                                                   OpenCL runtime allocates a
2784                                                   global buffer for the symbol
2785                                                   and saves the kernel's address
2786                                                   to it, which is used for
2787                                                   device side enqueueing. Only
2788                                                   available for device side
2789                                                   enqueued kernels.
2790      =================== ============== ========= ==============================
2791
2792 ..
2793
2794   .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2795      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2796
2797      ================= ============== ========= ================================
2798      String Key        Value Type     Required? Description
2799      ================= ============== ========= ================================
2800      "Name"            string                   Kernel argument name.
2801      "TypeName"        string                   Kernel argument type name.
2802      "Size"            integer        Required  Kernel argument size in bytes.
2803      "Align"           integer        Required  Kernel argument alignment in
2804                                                 bytes. Must be a power of two.
2805      "ValueKind"       string         Required  Kernel argument kind that
2806                                                 specifies how to set up the
2807                                                 corresponding argument.
2808                                                 Values include:
2809
2810                                                 "ByValue"
2811                                                   The argument is copied
2812                                                   directly into the kernarg.
2813
2814                                                 "GlobalBuffer"
2815                                                   A global address space pointer
2816                                                   to the buffer data is passed
2817                                                   in the kernarg.
2818
2819                                                 "DynamicSharedPointer"
2820                                                   A group address space pointer
2821                                                   to dynamically allocated LDS
2822                                                   is passed in the kernarg.
2823
2824                                                 "Sampler"
2825                                                   A global address space
2826                                                   pointer to a S# is passed in
2827                                                   the kernarg.
2828
2829                                                 "Image"
2830                                                   A global address space
2831                                                   pointer to a T# is passed in
2832                                                   the kernarg.
2833
2834                                                 "Pipe"
2835                                                   A global address space pointer
2836                                                   to an OpenCL pipe is passed in
2837                                                   the kernarg.
2838
2839                                                 "Queue"
2840                                                   A global address space pointer
2841                                                   to an OpenCL device enqueue
2842                                                   queue is passed in the
2843                                                   kernarg.
2844
2845                                                 "HiddenGlobalOffsetX"
2846                                                   The OpenCL grid dispatch
2847                                                   global offset for the X
2848                                                   dimension is passed in the
2849                                                   kernarg.
2850
2851                                                 "HiddenGlobalOffsetY"
2852                                                   The OpenCL grid dispatch
2853                                                   global offset for the Y
2854                                                   dimension is passed in the
2855                                                   kernarg.
2856
2857                                                 "HiddenGlobalOffsetZ"
2858                                                   The OpenCL grid dispatch
2859                                                   global offset for the Z
2860                                                   dimension is passed in the
2861                                                   kernarg.
2862
2863                                                 "HiddenNone"
2864                                                   An argument that is not used
2865                                                   by the kernel. Space needs to
2866                                                   be left for it, but it does
2867                                                   not need to be set up.
2868
2869                                                 "HiddenPrintfBuffer"
2870                                                   A global address space pointer
2871                                                   to the runtime printf buffer
2872                                                   is passed in kernarg. Mutually
2873                                                   exclusive with
2874                                                   "HiddenHostcallBuffer".
2875
2876                                                 "HiddenHostcallBuffer"
2877                                                   A global address space pointer
2878                                                   to the runtime hostcall buffer
2879                                                   is passed in kernarg. Mutually
2880                                                   exclusive with
2881                                                   "HiddenPrintfBuffer".
2882
2883                                                 "HiddenDefaultQueue"
2884                                                   A global address space pointer
2885                                                   to the OpenCL device enqueue
2886                                                   queue that should be used by
2887                                                   the kernel by default is
2888                                                   passed in the kernarg.
2889
2890                                                 "HiddenCompletionAction"
2891                                                   A global address space pointer
2892                                                   to help link enqueued kernels into
2893                                                   the ancestor tree for determining
2894                                                   when the parent kernel has finished.
2895
2896                                                 "HiddenMultiGridSyncArg"
2897                                                   A global address space pointer for
2898                                                   multi-grid synchronization is
2899                                                   passed in the kernarg.
2900
2901      "ValueType"       string                   Unused and deprecated. This should no longer
2902                                                 be emitted, but is accepted for compatibility.
2903
2904
2905      "PointeeAlign"    integer                  Alignment in bytes of pointee
2906                                                 type for pointer type kernel
2907                                                 argument. Must be a power
2908                                                 of 2. Only present if
2909                                                 "ValueKind" is
2910                                                 "DynamicSharedPointer".
2911      "AddrSpaceQual"   string                   Kernel argument address space
2912                                                 qualifier. Only present if
2913                                                 "ValueKind" is "GlobalBuffer" or
2914                                                 "DynamicSharedPointer". Values
2915                                                 are:
2916
2917                                                 - "Private"
2918                                                 - "Global"
2919                                                 - "Constant"
2920                                                 - "Local"
2921                                                 - "Generic"
2922                                                 - "Region"
2923
2924                                                 .. TODO::
2925
2926                                                    Is GlobalBuffer only Global
2927                                                    or Constant? Is
2928                                                    DynamicSharedPointer always
2929                                                    Local? Can HCC allow Generic?
2930                                                    How can Private or Region
2931                                                    ever happen?
2932
2933      "AccQual"         string                   Kernel argument access
2934                                                 qualifier. Only present if
2935                                                 "ValueKind" is "Image" or
2936                                                 "Pipe". Values
2937                                                 are:
2938
2939                                                 - "ReadOnly"
2940                                                 - "WriteOnly"
2941                                                 - "ReadWrite"
2942
2943                                                 .. TODO::
2944
2945                                                    Does this apply to
2946                                                    GlobalBuffer?
2947
2948      "ActualAccQual"   string                   The actual memory accesses
2949                                                 performed by the kernel on the
2950                                                 kernel argument. Only present if
2951                                                 "ValueKind" is "GlobalBuffer",
2952                                                 "Image", or "Pipe". This may be
2953                                                 more restrictive than indicated
2954                                                 by "AccQual" to reflect what the
2955                                                 kernel actual does. If not
2956                                                 present then the runtime must
2957                                                 assume what is implied by
2958                                                 "AccQual" and "IsConst". Values
2959                                                 are:
2960
2961                                                 - "ReadOnly"
2962                                                 - "WriteOnly"
2963                                                 - "ReadWrite"
2964
2965      "IsConst"         boolean                  Indicates if the kernel argument
2966                                                 is const qualified. Only present
2967                                                 if "ValueKind" is
2968                                                 "GlobalBuffer".
2969
2970      "IsRestrict"      boolean                  Indicates if the kernel argument
2971                                                 is restrict qualified. Only
2972                                                 present if "ValueKind" is
2973                                                 "GlobalBuffer".
2974
2975      "IsVolatile"      boolean                  Indicates if the kernel argument
2976                                                 is volatile qualified. Only
2977                                                 present if "ValueKind" is
2978                                                 "GlobalBuffer".
2979
2980      "IsPipe"          boolean                  Indicates if the kernel argument
2981                                                 is pipe qualified. Only present
2982                                                 if "ValueKind" is "Pipe".
2983
2984                                                 .. TODO::
2985
2986                                                    Can GlobalBuffer be pipe
2987                                                    qualified?
2988
2989      ================= ============== ========= ================================
2990
2991 ..
2992
2993   .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2994      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
2995
2996      ============================ ============== ========= =====================
2997      String Key                   Value Type     Required? Description
2998      ============================ ============== ========= =====================
2999      "KernargSegmentSize"         integer        Required  The size in bytes of
3000                                                            the kernarg segment
3001                                                            that holds the values
3002                                                            of the arguments to
3003                                                            the kernel.
3004      "GroupSegmentFixedSize"      integer        Required  The amount of group
3005                                                            segment memory
3006                                                            required by a
3007                                                            work-group in
3008                                                            bytes. This does not
3009                                                            include any
3010                                                            dynamically allocated
3011                                                            group segment memory
3012                                                            that may be added
3013                                                            when the kernel is
3014                                                            dispatched.
3015      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
3016                                                            private address space
3017                                                            memory required for a
3018                                                            work-item in
3019                                                            bytes. If the kernel
3020                                                            uses a dynamic call
3021                                                            stack then additional
3022                                                            space must be added
3023                                                            to this value for the
3024                                                            call stack.
3025      "KernargSegmentAlign"        integer        Required  The maximum byte
3026                                                            alignment of
3027                                                            arguments in the
3028                                                            kernarg segment. Must
3029                                                            be a power of 2.
3030      "WavefrontSize"              integer        Required  Wavefront size. Must
3031                                                            be a power of 2.
3032      "NumSGPRs"                   integer        Required  Number of scalar
3033                                                            registers used by a
3034                                                            wavefront for
3035                                                            GFX6-GFX11. This
3036                                                            includes the special
3037                                                            SGPRs for VCC, Flat
3038                                                            Scratch (GFX7-GFX10)
3039                                                            and XNACK (for
3040                                                            GFX8-GFX10). It does
3041                                                            not include the 16
3042                                                            SGPR added if a trap
3043                                                            handler is
3044                                                            enabled. It is not
3045                                                            rounded up to the
3046                                                            allocation
3047                                                            granularity.
3048      "NumVGPRs"                   integer        Required  Number of vector
3049                                                            registers used by
3050                                                            each work-item for
3051                                                            GFX6-GFX11
3052      "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
3053                                                            work-group size
3054                                                            supported by the
3055                                                            kernel in work-items.
3056                                                            Must be >=1 and
3057                                                            consistent with
3058                                                            ReqdWorkGroupSize if
3059                                                            not 0, 0, 0.
3060      "NumSpilledSGPRs"            integer                  Number of stores from
3061                                                            a scalar register to
3062                                                            a register allocator
3063                                                            created spill
3064                                                            location.
3065      "NumSpilledVGPRs"            integer                  Number of stores from
3066                                                            a vector register to
3067                                                            a register allocator
3068                                                            created spill
3069                                                            location.
3070      ============================ ============== ========= =====================
3071
3072 .. _amdgpu-amdhsa-code-object-metadata-v3:
3073
3074 Code Object V3 Metadata
3075 +++++++++++++++++++++++
3076
3077 .. warning::
3078   Code object V3 is not the default code object version emitted by this version
3079   of LLVM.
3080
3081 Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3082 record (see :ref:`amdgpu-note-records-v3-onwards`).
3083
3084 The metadata is represented as Message Pack formatted binary data (see
3085 [MsgPack]_). The top level is a Message Pack map that includes the
3086 keys defined in table
3087 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3088 tables.
3089
3090 Additional information can be added to the maps. To avoid conflicts,
3091 any key names should be prefixed by "*vendor-name*." where
3092 ``vendor-name`` can be the name of the vendor and specific vendor
3093 tool that generates the information. The prefix is abbreviated to
3094 simply "." when it appears within a map that has been added by the
3095 same *vendor-name*.
3096
3097   .. table:: AMDHSA Code Object V3 Metadata Map
3098      :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3099
3100      ================= ============== ========= =======================================
3101      String Key        Value Type     Required? Description
3102      ================= ============== ========= =======================================
3103      "amdhsa.version"  sequence of    Required  - The first integer is the major
3104                        2 integers                 version. Currently 1.
3105                                                 - The second integer is the minor
3106                                                   version. Currently 0.
3107      "amdhsa.printf"   sequence of              Each string is encoded information
3108                        strings                  about a printf function call. The
3109                                                 encoded information is organized as
3110                                                 fields separated by colon (':'):
3111
3112                                                 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3113
3114                                                 where:
3115
3116                                                 ``ID``
3117                                                   A 32-bit integer as a unique id for
3118                                                   each printf function call
3119
3120                                                 ``N``
3121                                                   A 32-bit integer equal to the number
3122                                                   of arguments of printf function call
3123                                                   minus 1
3124
3125                                                 ``S[i]`` (where i = 0, 1, ... , N-1)
3126                                                   32-bit integers for the size in bytes
3127                                                   of the i-th FormatString argument of
3128                                                   the printf function call
3129
3130                                                 FormatString
3131                                                   The format string passed to the
3132                                                   printf function call.
3133      "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
3134                        map                      kernel in the code object. See
3135                                                 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3136                                                 for the definition of the keys included
3137                                                 in that map.
3138      ================= ============== ========= =======================================
3139
3140 ..
3141
3142   .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3143      :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3144
3145      =================================== ============== ========= ================================
3146      String Key                          Value Type     Required? Description
3147      =================================== ============== ========= ================================
3148      ".name"                             string         Required  Source name of the kernel.
3149      ".symbol"                           string         Required  Name of the kernel
3150                                                                   descriptor ELF symbol.
3151      ".language"                         string                   Source language of the kernel.
3152                                                                   Values include:
3153
3154                                                                   - "OpenCL C"
3155                                                                   - "OpenCL C++"
3156                                                                   - "HCC"
3157                                                                   - "HIP"
3158                                                                   - "OpenMP"
3159                                                                   - "Assembler"
3160
3161      ".language_version"                 sequence of              - The first integer is the major
3162                                          2 integers                 version.
3163                                                                   - The second integer is the
3164                                                                     minor version.
3165      ".args"                             sequence of              Sequence of maps of the
3166                                          map                      kernel arguments. See
3167                                                                   :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3168                                                                   for the definition of the keys
3169                                                                   included in that map.
3170      ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
3171                                          3 integers               must be >=1 and the dispatch
3172                                                                   work-group size X, Y, Z must
3173                                                                   correspond to the specified
3174                                                                   values. Defaults to 0, 0, 0.
3175
3176                                                                   Corresponds to the OpenCL
3177                                                                   ``reqd_work_group_size``
3178                                                                   attribute.
3179      ".workgroup_size_hint"              sequence of              The dispatch work-group size
3180                                          3 integers               X, Y, Z is likely to be the
3181                                                                   specified values.
3182
3183                                                                   Corresponds to the OpenCL
3184                                                                   ``work_group_size_hint``
3185                                                                   attribute.
3186      ".vec_type_hint"                    string                   The name of a scalar or vector
3187                                                                   type.
3188
3189                                                                   Corresponds to the OpenCL
3190                                                                   ``vec_type_hint`` attribute.
3191
3192      ".device_enqueue_symbol"            string                   The external symbol name
3193                                                                   associated with a kernel.
3194                                                                   OpenCL runtime allocates a
3195                                                                   global buffer for the symbol
3196                                                                   and saves the kernel's address
3197                                                                   to it, which is used for
3198                                                                   device side enqueueing. Only
3199                                                                   available for device side
3200                                                                   enqueued kernels.
3201      ".kernarg_segment_size"             integer        Required  The size in bytes of
3202                                                                   the kernarg segment
3203                                                                   that holds the values
3204                                                                   of the arguments to
3205                                                                   the kernel.
3206      ".group_segment_fixed_size"         integer        Required  The amount of group
3207                                                                   segment memory
3208                                                                   required by a
3209                                                                   work-group in
3210                                                                   bytes. This does not
3211                                                                   include any
3212                                                                   dynamically allocated
3213                                                                   group segment memory
3214                                                                   that may be added
3215                                                                   when the kernel is
3216                                                                   dispatched.
3217      ".private_segment_fixed_size"       integer        Required  The amount of fixed
3218                                                                   private address space
3219                                                                   memory required for a
3220                                                                   work-item in
3221                                                                   bytes. If the kernel
3222                                                                   uses a dynamic call
3223                                                                   stack then additional
3224                                                                   space must be added
3225                                                                   to this value for the
3226                                                                   call stack.
3227      ".kernarg_segment_align"            integer        Required  The maximum byte
3228                                                                   alignment of
3229                                                                   arguments in the
3230                                                                   kernarg segment. Must
3231                                                                   be a power of 2.
3232      ".wavefront_size"                   integer        Required  Wavefront size. Must
3233                                                                   be a power of 2.
3234      ".sgpr_count"                       integer        Required  Number of scalar
3235                                                                   registers required by a
3236                                                                   wavefront for
3237                                                                   GFX6-GFX9. A register
3238                                                                   is required if it is
3239                                                                   used explicitly, or
3240                                                                   if a higher numbered
3241                                                                   register is used
3242                                                                   explicitly. This
3243                                                                   includes the special
3244                                                                   SGPRs for VCC, Flat
3245                                                                   Scratch (GFX7-GFX9)
3246                                                                   and XNACK (for
3247                                                                   GFX8-GFX9). It does
3248                                                                   not include the 16
3249                                                                   SGPR added if a trap
3250                                                                   handler is
3251                                                                   enabled. It is not
3252                                                                   rounded up to the
3253                                                                   allocation
3254                                                                   granularity.
3255      ".vgpr_count"                       integer        Required  Number of vector
3256                                                                   registers required by
3257                                                                   each work-item for
3258                                                                   GFX6-GFX9. A register
3259                                                                   is required if it is
3260                                                                   used explicitly, or
3261                                                                   if a higher numbered
3262                                                                   register is used
3263                                                                   explicitly.
3264      ".agpr_count"                       integer        Required  Number of accumulator
3265                                                                   registers required by
3266                                                                   each work-item for
3267                                                                   GFX90A, GFX908.
3268      ".max_flat_workgroup_size"          integer        Required  Maximum flat
3269                                                                   work-group size
3270                                                                   supported by the
3271                                                                   kernel in work-items.
3272                                                                   Must be >=1 and
3273                                                                   consistent with
3274                                                                   ReqdWorkGroupSize if
3275                                                                   not 0, 0, 0.
3276      ".sgpr_spill_count"                 integer                  Number of stores from
3277                                                                   a scalar register to
3278                                                                   a register allocator
3279                                                                   created spill
3280                                                                   location.
3281      ".vgpr_spill_count"                 integer                  Number of stores from
3282                                                                   a vector register to
3283                                                                   a register allocator
3284                                                                   created spill
3285                                                                   location.
3286      ".kind"                             string                   The kind of the kernel
3287                                                                   with the following
3288                                                                   values:
3289
3290                                                                   "normal"
3291                                                                     Regular kernels.
3292
3293                                                                   "init"
3294                                                                     These kernels must be
3295                                                                     invoked after loading
3296                                                                     the containing code
3297                                                                     object and must
3298                                                                     complete before any
3299                                                                     normal and fini
3300                                                                     kernels in the same
3301                                                                     code object are
3302                                                                     invoked.
3303
3304                                                                   "fini"
3305                                                                     These kernels must be
3306                                                                     invoked before
3307                                                                     unloading the
3308                                                                     containing code object
3309                                                                     and after all init and
3310                                                                     normal kernels in the
3311                                                                     same code object have
3312                                                                     been invoked and
3313                                                                     completed.
3314
3315                                                                   If omitted, "normal" is
3316                                                                   assumed.
3317      =================================== ============== ========= ================================
3318
3319 ..
3320
3321   .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3322      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3323
3324      ====================== ============== ========= ================================
3325      String Key             Value Type     Required? Description
3326      ====================== ============== ========= ================================
3327      ".name"                string                   Kernel argument name.
3328      ".type_name"           string                   Kernel argument type name.
3329      ".size"                integer        Required  Kernel argument size in bytes.
3330      ".offset"              integer        Required  Kernel argument offset in
3331                                                      bytes. The offset must be a
3332                                                      multiple of the alignment
3333                                                      required by the argument.
3334      ".value_kind"          string         Required  Kernel argument kind that
3335                                                      specifies how to set up the
3336                                                      corresponding argument.
3337                                                      Values include:
3338
3339                                                      "by_value"
3340                                                        The argument is copied
3341                                                        directly into the kernarg.
3342
3343                                                      "global_buffer"
3344                                                        A global address space pointer
3345                                                        to the buffer data is passed
3346                                                        in the kernarg.
3347
3348                                                      "dynamic_shared_pointer"
3349                                                        A group address space pointer
3350                                                        to dynamically allocated LDS
3351                                                        is passed in the kernarg.
3352
3353                                                      "sampler"
3354                                                        A global address space
3355                                                        pointer to a S# is passed in
3356                                                        the kernarg.
3357
3358                                                      "image"
3359                                                        A global address space
3360                                                        pointer to a T# is passed in
3361                                                        the kernarg.
3362
3363                                                      "pipe"
3364                                                        A global address space pointer
3365                                                        to an OpenCL pipe is passed in
3366                                                        the kernarg.
3367
3368                                                      "queue"
3369                                                        A global address space pointer
3370                                                        to an OpenCL device enqueue
3371                                                        queue is passed in the
3372                                                        kernarg.
3373
3374                                                      "hidden_global_offset_x"
3375                                                        The OpenCL grid dispatch
3376                                                        global offset for the X
3377                                                        dimension is passed in the
3378                                                        kernarg.
3379
3380                                                      "hidden_global_offset_y"
3381                                                        The OpenCL grid dispatch
3382                                                        global offset for the Y
3383                                                        dimension is passed in the
3384                                                        kernarg.
3385
3386                                                      "hidden_global_offset_z"
3387                                                        The OpenCL grid dispatch
3388                                                        global offset for the Z
3389                                                        dimension is passed in the
3390                                                        kernarg.
3391
3392                                                      "hidden_none"
3393                                                        An argument that is not used
3394                                                        by the kernel. Space needs to
3395                                                        be left for it, but it does
3396                                                        not need to be set up.
3397
3398                                                      "hidden_printf_buffer"
3399                                                        A global address space pointer
3400                                                        to the runtime printf buffer
3401                                                        is passed in kernarg. Mutually
3402                                                        exclusive with
3403                                                        "hidden_hostcall_buffer"
3404                                                        before Code Object V5.
3405
3406                                                      "hidden_hostcall_buffer"
3407                                                        A global address space pointer
3408                                                        to the runtime hostcall buffer
3409                                                        is passed in kernarg. Mutually
3410                                                        exclusive with
3411                                                        "hidden_printf_buffer"
3412                                                        before Code Object V5.
3413
3414                                                      "hidden_default_queue"
3415                                                        A global address space pointer
3416                                                        to the OpenCL device enqueue
3417                                                        queue that should be used by
3418                                                        the kernel by default is
3419                                                        passed in the kernarg.
3420
3421                                                      "hidden_completion_action"
3422                                                        A global address space pointer
3423                                                        to help link enqueued kernels into
3424                                                        the ancestor tree for determining
3425                                                        when the parent kernel has finished.
3426
3427                                                      "hidden_multigrid_sync_arg"
3428                                                        A global address space pointer for
3429                                                        multi-grid synchronization is
3430                                                        passed in the kernarg.
3431
3432      ".value_type"          string                    Unused and deprecated. This should no longer
3433                                                       be emitted, but is accepted for compatibility.
3434
3435      ".pointee_align"       integer                  Alignment in bytes of pointee
3436                                                      type for pointer type kernel
3437                                                      argument. Must be a power
3438                                                      of 2. Only present if
3439                                                      ".value_kind" is
3440                                                      "dynamic_shared_pointer".
3441      ".address_space"       string                   Kernel argument address space
3442                                                      qualifier. Only present if
3443                                                      ".value_kind" is "global_buffer" or
3444                                                      "dynamic_shared_pointer". Values
3445                                                      are:
3446
3447                                                      - "private"
3448                                                      - "global"
3449                                                      - "constant"
3450                                                      - "local"
3451                                                      - "generic"
3452                                                      - "region"
3453
3454                                                      .. TODO::
3455
3456                                                         Is "global_buffer" only "global"
3457                                                         or "constant"? Is
3458                                                         "dynamic_shared_pointer" always
3459                                                         "local"? Can HCC allow "generic"?
3460                                                         How can "private" or "region"
3461                                                         ever happen?
3462
3463      ".access"              string                   Kernel argument access
3464                                                      qualifier. Only present if
3465                                                      ".value_kind" is "image" or
3466                                                      "pipe". Values
3467                                                      are:
3468
3469                                                      - "read_only"
3470                                                      - "write_only"
3471                                                      - "read_write"
3472
3473                                                      .. TODO::
3474
3475                                                         Does this apply to
3476                                                         "global_buffer"?
3477
3478      ".actual_access"       string                   The actual memory accesses
3479                                                      performed by the kernel on the
3480                                                      kernel argument. Only present if
3481                                                      ".value_kind" is "global_buffer",
3482                                                      "image", or "pipe". This may be
3483                                                      more restrictive than indicated
3484                                                      by ".access" to reflect what the
3485                                                      kernel actual does. If not
3486                                                      present then the runtime must
3487                                                      assume what is implied by
3488                                                      ".access" and ".is_const"      . Values
3489                                                      are:
3490
3491                                                      - "read_only"
3492                                                      - "write_only"
3493                                                      - "read_write"
3494
3495      ".is_const"            boolean                  Indicates if the kernel argument
3496                                                      is const qualified. Only present
3497                                                      if ".value_kind" is
3498                                                      "global_buffer".
3499
3500      ".is_restrict"         boolean                  Indicates if the kernel argument
3501                                                      is restrict qualified. Only
3502                                                      present if ".value_kind" is
3503                                                      "global_buffer".
3504
3505      ".is_volatile"         boolean                  Indicates if the kernel argument
3506                                                      is volatile qualified. Only
3507                                                      present if ".value_kind" is
3508                                                      "global_buffer".
3509
3510      ".is_pipe"             boolean                  Indicates if the kernel argument
3511                                                      is pipe qualified. Only present
3512                                                      if ".value_kind" is "pipe".
3513
3514                                                      .. TODO::
3515
3516                                                         Can "global_buffer" be pipe
3517                                                         qualified?
3518
3519      ====================== ============== ========= ================================
3520
3521 .. _amdgpu-amdhsa-code-object-metadata-v4:
3522
3523 Code Object V4 Metadata
3524 +++++++++++++++++++++++
3525
3526 Code object V4 metadata is the same as
3527 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3528 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
3529
3530   .. table:: AMDHSA Code Object V4 Metadata Map Changes
3531      :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3532
3533      ================= ============== ========= =======================================
3534      String Key        Value Type     Required? Description
3535      ================= ============== ========= =======================================
3536      "amdhsa.version"  sequence of    Required  - The first integer is the major
3537                        2 integers                 version. Currently 1.
3538                                                 - The second integer is the minor
3539                                                   version. Currently 1.
3540      "amdhsa.target"   string         Required  The target name of the code using the syntax:
3541
3542                                                 .. code::
3543
3544                                                   <target-triple> [ "-" <target-id> ]
3545
3546                                                 A canonical target ID must be
3547                                                 used. See :ref:`amdgpu-target-triples`
3548                                                 and :ref:`amdgpu-target-id`.
3549      ================= ============== ========= =======================================
3550
3551 .. _amdgpu-amdhsa-code-object-metadata-v5:
3552
3553 Code Object V5 Metadata
3554 +++++++++++++++++++++++
3555
3556 .. warning::
3557   Code object V5 is not the default code object version emitted by this version
3558   of LLVM.
3559
3560
3561 Code object V5 metadata is the same as
3562 :ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
3563 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5`, table
3564 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5` and table
3565 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
3566
3567   .. table:: AMDHSA Code Object V5 Metadata Map Changes
3568      :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
3569
3570      ================= ============== ========= =======================================
3571      String Key        Value Type     Required? Description
3572      ================= ============== ========= =======================================
3573      "amdhsa.version"  sequence of    Required  - The first integer is the major
3574                        2 integers                 version. Currently 1.
3575                                                 - The second integer is the minor
3576                                                   version. Currently 2.
3577      ================= ============== ========= =======================================
3578
3579 ..
3580
3581   .. table:: AMDHSA Code Object V5 Kernel Metadata Map Additions
3582      :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5
3583
3584      ============================= ============= ========== =======================================
3585      String Key                    Value Type     Required? Description
3586      ============================= ============= ========== =======================================
3587      ".uses_dynamic_stack"         boolean                  Indicates if the generated machine code
3588                                                             is using a dynamically sized stack.
3589      ".workgroup_processor_mode"   boolean                  (GFX10+) Controls ENABLE_WGP_MODE in
3590                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3591      ============================= ============= ========== =======================================
3592
3593 ..
3594
3595   .. table:: AMDHSA Code Object V5 Kernel Attribute Metadata Map
3596      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v5-table
3597
3598      =========================== ============== ========= ==============================
3599      String Key                  Value Type     Required? Description
3600      =========================== ============== ========= ==============================
3601      ".uniform_work_group_size"  integer                  Indicates if the kernel
3602                                                           requires that each dimension
3603                                                           of global size is a multiple
3604                                                           of corresponding dimension of
3605                                                           work-group size. Value of 1
3606                                                           implies true and value of 0
3607                                                           implies false. Metadata is
3608                                                           only emitted when value is 1.
3609      =========================== ============== ========= ==============================
3610
3611 ..
3612
3613 ..
3614
3615   .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
3616      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
3617
3618      ====================== ============== ========= ================================
3619      String Key             Value Type     Required? Description
3620      ====================== ============== ========= ================================
3621      ".value_kind"          string         Required  Kernel argument kind that
3622                                                      specifies how to set up the
3623                                                      corresponding argument.
3624                                                      Values include:
3625                                                      the same as code object V3 metadata
3626                                                      (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
3627                                                      with the following additions:
3628
3629                                                      "hidden_block_count_x"
3630                                                        The grid dispatch work-group count for the X dimension
3631                                                        is passed in the kernarg. Some languages, such as OpenCL,
3632                                                        support a last work-group in each dimension being partial.
3633                                                        This count only includes the non-partial work-group count.
3634                                                        This is not the same as the value in the AQL dispatch packet,
3635                                                        which has the grid size in work-items.
3636
3637                                                      "hidden_block_count_y"
3638                                                        The grid dispatch work-group count for the Y dimension
3639                                                        is passed in the kernarg. Some languages, such as OpenCL,
3640                                                        support a last work-group in each dimension being partial.
3641                                                        This count only includes the non-partial work-group count.
3642                                                        This is not the same as the value in the AQL dispatch packet,
3643                                                        which has the grid size in work-items. If the grid dimensionality
3644                                                        is 1, then must be 1.
3645
3646                                                      "hidden_block_count_z"
3647                                                        The grid dispatch work-group count for the Z dimension
3648                                                        is passed in the kernarg. Some languages, such as OpenCL,
3649                                                        support a last work-group in each dimension being partial.
3650                                                        This count only includes the non-partial work-group count.
3651                                                        This is not the same as the value in the AQL dispatch packet,
3652                                                        which has the grid size in work-items. If the grid dimensionality
3653                                                        is 1 or 2, then must be 1.
3654
3655                                                      "hidden_group_size_x"
3656                                                        The grid dispatch work-group size for the X dimension is
3657                                                        passed in the kernarg. This size only applies to the
3658                                                        non-partial work-groups. This is the same value as the AQL
3659                                                        dispatch packet work-group size.
3660
3661                                                      "hidden_group_size_y"
3662                                                        The grid dispatch work-group size for the Y dimension is
3663                                                        passed in the kernarg. This size only applies to the
3664                                                        non-partial work-groups. This is the same value as the AQL
3665                                                        dispatch packet work-group size. If the grid dimensionality
3666                                                        is 1, then must be 1.
3667
3668                                                      "hidden_group_size_z"
3669                                                        The grid dispatch work-group size for the Z dimension is
3670                                                        passed in the kernarg. This size only applies to the
3671                                                        non-partial work-groups. This is the same value as the AQL
3672                                                        dispatch packet work-group size. If the grid dimensionality
3673                                                        is 1 or 2, then must be 1.
3674
3675                                                      "hidden_remainder_x"
3676                                                        The grid dispatch work group size of the partial work group
3677                                                        of the X dimension, if it exists. Must be zero if a partial
3678                                                        work group does not exist in the X dimension.
3679
3680                                                      "hidden_remainder_y"
3681                                                        The grid dispatch work group size of the partial work group
3682                                                        of the Y dimension, if it exists. Must be zero if a partial
3683                                                        work group does not exist in the Y dimension.
3684
3685                                                      "hidden_remainder_z"
3686                                                        The grid dispatch work group size of the partial work group
3687                                                        of the Z dimension, if it exists. Must be zero if a partial
3688                                                        work group does not exist in the Z dimension.
3689
3690                                                      "hidden_grid_dims"
3691                                                        The grid dispatch dimensionality. This is the same value
3692                                                        as the AQL dispatch packet dimensionality. Must be a value
3693                                                        between 1 and 3.
3694
3695                                                      "hidden_heap_v1"
3696                                                        A global address space pointer to an initialized memory
3697                                                        buffer that conforms to the requirements of the malloc/free
3698                                                        device library V1 version implementation.
3699
3700                                                      "hidden_private_base"
3701                                                        The high 32 bits of the flat addressing private aperture base.
3702                                                        Only used by GFX8 to allow conversion between private segment
3703                                                        and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3704
3705                                                      "hidden_shared_base"
3706                                                        The high 32 bits of the flat addressing shared aperture base.
3707                                                        Only used by GFX8 to allow conversion between shared segment
3708                                                        and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3709
3710                                                      "hidden_queue_ptr"
3711                                                        A global memory address space pointer to the ROCm runtime
3712                                                        ``struct amd_queue_t`` structure for the HSA queue of the
3713                                                        associated dispatch AQL packet. It is only required for pre-GFX9
3714                                                        devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
3715
3716      ====================== ============== ========= ================================
3717
3718 ..
3719
3720 Kernel Dispatch
3721 ~~~~~~~~~~~~~~~
3722
3723 The HSA architected queuing language (AQL) defines a user space memory interface
3724 that can be used to control the dispatch of kernels, in an agent independent
3725 way. An agent can have zero or more AQL queues created for it using an HSA
3726 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3727 are 64 bytes) can be placed. See the *HSA Platform System Architecture
3728 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3729
3730 The packet processor of a kernel agent is responsible for detecting and
3731 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3732 packet processor is implemented by the hardware command processor (CP),
3733 asynchronous dispatch controller (ADC) and shader processor input controller
3734 (SPI).
3735
3736 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3737 the kernel mode driver to initialize and register the AQL queue with CP.
3738
3739 To dispatch a kernel the following actions are performed. This can occur in the
3740 CPU host program, or from an HSA kernel executing on a GPU.
3741
3742 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3743    executed is obtained.
3744 2. A pointer to the kernel descriptor (see
3745    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3746    It must be for a kernel that is contained in a code object that was loaded
3747    by an HSA compatible runtime on the kernel agent with which the AQL queue is
3748    associated.
3749 3. Space is allocated for the kernel arguments using the HSA compatible runtime
3750    allocator for a memory region with the kernarg property for the kernel agent
3751    that will execute the kernel. It must be at least 16-byte aligned.
3752 4. Kernel argument values are assigned to the kernel argument memory
3753    allocation. The layout is defined in the *HSA Programmer's Language
3754    Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3755    kernel argument memory in the same way constant memory is accessed. (Note
3756    that the HSA specification allows an implementation to copy the kernel
3757    argument contents to another location that is accessed by the kernel.)
3758 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3759    runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3760    for the packet. The packet must be set up, and the final write must use an
3761    atomic store release to set the packet kind to ensure the packet contents are
3762    visible to the kernel agent. AQL defines a doorbell signal mechanism to
3763    notify the kernel agent that the AQL queue has been updated. These rules, and
3764    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3765    System Architecture Specification* [HSA]_.
3766 6. A kernel dispatch packet includes information about the actual dispatch,
3767    such as grid and work-group size, together with information from the code
3768    object about the kernel, such as segment sizes. The HSA compatible runtime
3769    queries on the kernel symbol can be used to obtain the code object values
3770    which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
3771 7. CP executes micro-code and is responsible for detecting and setting up the
3772    GPU to execute the wavefronts of a kernel dispatch.
3773 8. CP ensures that when the a wavefront starts executing the kernel machine
3774    code, the scalar general purpose registers (SGPR) and vector general purpose
3775    registers (VGPR) are set up as required by the machine code. The required
3776    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3777    register state is defined in
3778    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
3779 9. The prolog of the kernel machine code (see
3780    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3781    before continuing executing the machine code that corresponds to the kernel.
3782 10. When the kernel dispatch has completed execution, CP signals the completion
3783     signal specified in the kernel dispatch packet if not 0.
3784
3785 .. _amdgpu-amdhsa-memory-spaces:
3786
3787 Memory Spaces
3788 ~~~~~~~~~~~~~
3789
3790 The memory space properties are:
3791
3792   .. table:: AMDHSA Memory Spaces
3793      :name: amdgpu-amdhsa-memory-spaces-table
3794
3795      ================= =========== ======== ======= ==================
3796      Memory Space Name HSA Segment Hardware Address NULL Value
3797                        Name        Name     Size
3798      ================= =========== ======== ======= ==================
3799      Private           private     scratch  32      0x00000000
3800      Local             group       LDS      32      0xFFFFFFFF
3801      Global            global      global   64      0x0000000000000000
3802      Constant          constant    *same as 64      0x0000000000000000
3803                                    global*
3804      Generic           flat        flat     64      0x0000000000000000
3805      Region            N/A         GDS      32      *not implemented
3806                                                     for AMDHSA*
3807      ================= =========== ======== ======= ==================
3808
3809 The global and constant memory spaces both use global virtual addresses, which
3810 are the same virtual address space used by the CPU. However, some virtual
3811 addresses may only be accessible to the CPU, some only accessible by the GPU,
3812 and some by both.
3813
3814 Using the constant memory space indicates that the data will not change during
3815 the execution of the kernel. This allows scalar read instructions to be
3816 used. The vector and scalar L1 caches are invalidated of volatile data before
3817 each kernel dispatch execution to allow constant memory to change values between
3818 kernel dispatches.
3819
3820 The local memory space uses the hardware Local Data Store (LDS) which is
3821 automatically allocated when the hardware creates work-groups of wavefronts, and
3822 freed when all the wavefronts of a work-group have terminated. The data store
3823 (DS) instructions can be used to access it.
3824
3825 The private memory space uses the hardware scratch memory support. If the kernel
3826 uses scratch, then the hardware allocates memory that is accessed using
3827 wavefront lane dword (4 byte) interleaving. The mapping used from private
3828 address to physical address is:
3829
3830   ``wavefront-scratch-base +
3831   (private-address * wavefront-size * 4) +
3832   (wavefront-lane-id * 4)``
3833
3834 There are different ways that the wavefront scratch base address is determined
3835 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3836 memory can be accessed in an interleaved manner using buffer instruction with
3837 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3838 instructions, or by flat instructions. If each lane of a wavefront accesses the
3839 same private address, the interleaving results in adjacent dwords being accessed
3840 and hence requires fewer cache lines to be fetched. Multi-dword access is not
3841 supported except by flat and scratch instructions in GFX9-GFX11.
3842
3843 The generic address space uses the hardware flat address support available in
3844 GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
3845 local apertures), that are outside the range of addressible global memory, to
3846 map from a flat address to a private or local address.
3847
3848 FLAT instructions can take a flat address and access global, private (scratch)
3849 and group (LDS) memory depending on if the address is within one of the
3850 aperture ranges. Flat access to scratch requires hardware aperture setup and
3851 setup in the kernel prologue (see
3852 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3853 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3854 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3855
3856 To convert between a segment address and a flat address the base address of the
3857 apertures address can be used. For GFX7-GFX8 these are available in the
3858 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3859 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3860 GFX9-GFX11 the aperture base addresses are directly available as inline constant
3861 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3862 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3863 which makes it easier to convert from flat to segment or segment to flat.
3864
3865 Image and Samplers
3866 ~~~~~~~~~~~~~~~~~~
3867
3868 Image and sample handles created by an HSA compatible runtime (see
3869 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3870 object respectively. In order to support the HSA ``query_sampler`` operations
3871 two extra dwords are used to store the HSA BRIG enumeration values for the
3872 queries that are not trivially deducible from the S# representation.
3873
3874 HSA Signals
3875 ~~~~~~~~~~~
3876
3877 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3878 are 64-bit addresses of a structure allocated in memory accessible from both the
3879 CPU and GPU. The structure is defined by the runtime and subject to change
3880 between releases. For example, see [AMD-ROCm-github]_.
3881
3882 .. _amdgpu-amdhsa-hsa-aql-queue:
3883
3884 HSA AQL Queue
3885 ~~~~~~~~~~~~~
3886
3887 The HSA AQL queue structure is defined by an HSA compatible runtime (see
3888 :ref:`amdgpu-os`) and subject to change between releases. For example, see
3889 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
3890 certain language features such as the flat address aperture bases. It also
3891 contains fields used by CP such as managing the allocation of scratch memory.
3892
3893 .. _amdgpu-amdhsa-kernel-descriptor:
3894
3895 Kernel Descriptor
3896 ~~~~~~~~~~~~~~~~~
3897
3898 A kernel descriptor consists of the information needed by CP to initiate the
3899 execution of a kernel, including the entry point address of the machine code
3900 that implements the kernel.
3901
3902 Code Object V3 Kernel Descriptor
3903 ++++++++++++++++++++++++++++++++
3904
3905 CP microcode requires the Kernel descriptor to be allocated on 64-byte
3906 alignment.
3907
3908 The fields used by CP for code objects before V3 also match those specified in
3909 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3910
3911   .. table:: Code Object V3 Kernel Descriptor
3912      :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3913
3914      ======= ======= =============================== ============================
3915      Bits    Size    Field Name                      Description
3916      ======= ======= =============================== ============================
3917      31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
3918                                                      address space memory
3919                                                      required for a work-group
3920                                                      in bytes. This does not
3921                                                      include any dynamically
3922                                                      allocated local address
3923                                                      space memory that may be
3924                                                      added when the kernel is
3925                                                      dispatched.
3926      63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
3927                                                      private address space
3928                                                      memory required for a
3929                                                      work-item in bytes.  When
3930                                                      this cannot be predicted,
3931                                                      code object v4 and older
3932                                                      sets this value to be
3933                                                      higher than the minimum
3934                                                      requirement.
3935      95:64   4 bytes KERNARG_SIZE                    The size of the kernarg
3936                                                      memory pointed to by the
3937                                                      AQL dispatch packet. The
3938                                                      kernarg memory is used to
3939                                                      pass arguments to the
3940                                                      kernel.
3941
3942                                                      * If the kernarg pointer in
3943                                                        the dispatch packet is NULL
3944                                                        then there are no kernel
3945                                                        arguments.
3946                                                      * If the kernarg pointer in
3947                                                        the dispatch packet is
3948                                                        not NULL and this value
3949                                                        is 0 then the kernarg
3950                                                        memory size is
3951                                                        unspecified.
3952                                                      * If the kernarg pointer in
3953                                                        the dispatch packet is
3954                                                        not NULL and this value
3955                                                        is not 0 then the value
3956                                                        specifies the kernarg
3957                                                        memory size in bytes. It
3958                                                        is recommended to provide
3959                                                        a value as it may be used
3960                                                        by CP to optimize making
3961                                                        the kernarg memory
3962                                                        visible to the kernel
3963                                                        code.
3964
3965      127:96  4 bytes                                 Reserved, must be 0.
3966      191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
3967                                                      negative) from base
3968                                                      address of kernel
3969                                                      descriptor to kernel's
3970                                                      entry point instruction
3971                                                      which must be 256 byte
3972                                                      aligned.
3973      351:272 20                                      Reserved, must be 0.
3974              bytes
3975      383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-GFX9
3976                                                        Reserved, must be 0.
3977                                                      GFX90A, GFX940
3978                                                        Compute Shader (CS)
3979                                                        program settings used by
3980                                                        CP to set up
3981                                                        ``COMPUTE_PGM_RSRC3``
3982                                                        configuration
3983                                                        register. See
3984                                                        :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
3985                                                      GFX10-GFX11
3986                                                        Compute Shader (CS)
3987                                                        program settings used by
3988                                                        CP to set up
3989                                                        ``COMPUTE_PGM_RSRC3``
3990                                                        configuration
3991                                                        register. See
3992                                                        :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
3993      415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
3994                                                      program settings used by
3995                                                      CP to set up
3996                                                      ``COMPUTE_PGM_RSRC1``
3997                                                      configuration
3998                                                      register. See
3999                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
4000      447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
4001                                                      program settings used by
4002                                                      CP to set up
4003                                                      ``COMPUTE_PGM_RSRC2``
4004                                                      configuration
4005                                                      register. See
4006                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
4007      458:448 7 bits  *See separate bits below.*      Enable the setup of the
4008                                                      SGPR user data registers
4009                                                      (see
4010                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4011
4012                                                      The total number of SGPR
4013                                                      user data registers
4014                                                      requested must not exceed
4015                                                      16 and match value in
4016                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
4017                                                      Any requests beyond 16
4018                                                      will be ignored.
4019      >448    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     If the *Target Properties*
4020                      _BUFFER                         column of
4021                                                      :ref:`amdgpu-processor-table`
4022                                                      specifies *Architected flat
4023                                                      scratch* then not supported
4024                                                      and must be 0,
4025      >449    1 bit   ENABLE_SGPR_DISPATCH_PTR
4026      >450    1 bit   ENABLE_SGPR_QUEUE_PTR
4027      >451    1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR
4028      >452    1 bit   ENABLE_SGPR_DISPATCH_ID
4029      >453    1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   If the *Target Properties*
4030                                                      column of
4031                                                      :ref:`amdgpu-processor-table`
4032                                                      specifies *Architected flat
4033                                                      scratch* then not supported
4034                                                      and must be 0,
4035      >454    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT
4036                      _SIZE
4037      457:455 3 bits                                  Reserved, must be 0.
4038      458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-GFX9
4039                                                        Reserved, must be 0.
4040                                                      GFX10-GFX11
4041                                                        - If 0 execute in
4042                                                          wavefront size 64 mode.
4043                                                        - If 1 execute in
4044                                                          native wavefront size
4045                                                          32 mode.
4046      459     1 bit   USES_DYNAMIC_STACK              Indicates if the generated
4047                                                      machine code is using a
4048                                                      dynamically sized stack.
4049                                                      This is only set in code
4050                                                      object v5 and later.
4051      463:460 1 bit                                   Reserved, must be 0.
4052      464     1 bit   RESERVED_464                    Deprecated, must be 0.
4053      467:465 3 bits                                  Reserved, must be 0.
4054      468     1 bit   RESERVED_468                    Deprecated, must be 0.
4055      469:471 3 bits                                  Reserved, must be 0.
4056      511:472 5 bytes                                 Reserved, must be 0.
4057      512     **Total size 64 bytes.**
4058      ======= ====================================================================
4059
4060 ..
4061
4062   .. table:: compute_pgm_rsrc1 for GFX6-GFX11
4063      :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table
4064
4065      ======= ======= =============================== ===========================================================================
4066      Bits    Size    Field Name                      Description
4067      ======= ======= =============================== ===========================================================================
4068      5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
4069                                                      blocks used by each work-item;
4070                                                      granularity is device
4071                                                      specific:
4072
4073                                                      GFX6-GFX9
4074                                                        - vgprs_used 0..256
4075                                                        - max(0, ceil(vgprs_used / 4) - 1)
4076                                                      GFX90A, GFX940
4077                                                        - vgprs_used 0..512
4078                                                        - vgprs_used = align(arch_vgprs, 4)
4079                                                                       + acc_vgprs
4080                                                        - max(0, ceil(vgprs_used / 8) - 1)
4081                                                      GFX10-GFX11 (wavefront size 64)
4082                                                        - max_vgpr 1..256
4083                                                        - max(0, ceil(vgprs_used / 4) - 1)
4084                                                      GFX10-GFX11 (wavefront size 32)
4085                                                        - max_vgpr 1..256
4086                                                        - max(0, ceil(vgprs_used / 8) - 1)
4087
4088                                                      Where vgprs_used is defined
4089                                                      as the highest VGPR number
4090                                                      explicitly referenced plus
4091                                                      one.
4092
4093                                                      Used by CP to set up
4094                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
4095
4096                                                      The
4097                                                      :ref:`amdgpu-assembler`
4098                                                      calculates this
4099                                                      automatically for the
4100                                                      selected processor from
4101                                                      values provided to the
4102                                                      `.amdhsa_kernel` directive
4103                                                      by the
4104                                                      `.amdhsa_next_free_vgpr`
4105                                                      nested directive (see
4106                                                      :ref:`amdhsa-kernel-directives-table`).
4107      9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4108                                                      blocks used by a wavefront;
4109                                                      granularity is device
4110                                                      specific:
4111
4112                                                      GFX6-GFX8
4113                                                        - sgprs_used 0..112
4114                                                        - max(0, ceil(sgprs_used / 8) - 1)
4115                                                      GFX9
4116                                                        - sgprs_used 0..112
4117                                                        - 2 * max(0, ceil(sgprs_used / 16) - 1)
4118                                                      GFX10-GFX11
4119                                                        Reserved, must be 0.
4120                                                        (128 SGPRs always
4121                                                        allocated.)
4122
4123                                                      Where sgprs_used is
4124                                                      defined as the highest
4125                                                      SGPR number explicitly
4126                                                      referenced plus one, plus
4127                                                      a target specific number
4128                                                      of additional special
4129                                                      SGPRs for VCC,
4130                                                      FLAT_SCRATCH (GFX7+) and
4131                                                      XNACK_MASK (GFX8+), and
4132                                                      any additional
4133                                                      target specific
4134                                                      limitations. It does not
4135                                                      include the 16 SGPRs added
4136                                                      if a trap handler is
4137                                                      enabled.
4138
4139                                                      The target specific
4140                                                      limitations and special
4141                                                      SGPR layout are defined in
4142                                                      the hardware
4143                                                      documentation, which can
4144                                                      be found in the
4145                                                      :ref:`amdgpu-processors`
4146                                                      table.
4147
4148                                                      Used by CP to set up
4149                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
4150
4151                                                      The
4152                                                      :ref:`amdgpu-assembler`
4153                                                      calculates this
4154                                                      automatically for the
4155                                                      selected processor from
4156                                                      values provided to the
4157                                                      `.amdhsa_kernel` directive
4158                                                      by the
4159                                                      `.amdhsa_next_free_sgpr`
4160                                                      and `.amdhsa_reserve_*`
4161                                                      nested directives (see
4162                                                      :ref:`amdhsa-kernel-directives-table`).
4163      11:10   2 bits  PRIORITY                        Must be 0.
4164
4165                                                      Start executing wavefront
4166                                                      at the specified priority.
4167
4168                                                      CP is responsible for
4169                                                      filling in
4170                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
4171      13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
4172                                                      with specified rounding
4173                                                      mode for single (32
4174                                                      bit) floating point
4175                                                      precision floating point
4176                                                      operations.
4177
4178                                                      Floating point rounding
4179                                                      mode values are defined in
4180                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4181
4182                                                      Used by CP to set up
4183                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4184      15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
4185                                                      with specified rounding
4186                                                      denorm mode for half/double (16
4187                                                      and 64-bit) floating point
4188                                                      precision floating point
4189                                                      operations.
4190
4191                                                      Floating point rounding
4192                                                      mode values are defined in
4193                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4194
4195                                                      Used by CP to set up
4196                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4197      17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
4198                                                      with specified denorm mode
4199                                                      for single (32
4200                                                      bit)  floating point
4201                                                      precision floating point
4202                                                      operations.
4203
4204                                                      Floating point denorm mode
4205                                                      values are defined in
4206                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4207
4208                                                      Used by CP to set up
4209                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4210      19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
4211                                                      with specified denorm mode
4212                                                      for half/double (16
4213                                                      and 64-bit) floating point
4214                                                      precision floating point
4215                                                      operations.
4216
4217                                                      Floating point denorm mode
4218                                                      values are defined in
4219                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4220
4221                                                      Used by CP to set up
4222                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4223      20      1 bit   PRIV                            Must be 0.
4224
4225                                                      Start executing wavefront
4226                                                      in privilege trap handler
4227                                                      mode.
4228
4229                                                      CP is responsible for
4230                                                      filling in
4231                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
4232      21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
4233                                                      with DX10 clamp mode
4234                                                      enabled. Used by the vector
4235                                                      ALU to force DX10 style
4236                                                      treatment of NaN's (when
4237                                                      set, clamp NaN to zero,
4238                                                      otherwise pass NaN
4239                                                      through).
4240
4241                                                      Used by CP to set up
4242                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4243      22      1 bit   DEBUG_MODE                      Must be 0.
4244
4245                                                      Start executing wavefront
4246                                                      in single step mode.
4247
4248                                                      CP is responsible for
4249                                                      filling in
4250                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4251      23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
4252                                                      with IEEE mode
4253                                                      enabled. Floating point
4254                                                      opcodes that support
4255                                                      exception flag gathering
4256                                                      will quiet and propagate
4257                                                      signaling-NaN inputs per
4258                                                      IEEE 754-2008. Min_dx10 and
4259                                                      max_dx10 become IEEE
4260                                                      754-2008 compliant due to
4261                                                      signaling-NaN propagation
4262                                                      and quieting.
4263
4264                                                      Used by CP to set up
4265                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4266      24      1 bit   BULKY                           Must be 0.
4267
4268                                                      Only one work-group allowed
4269                                                      to execute on a compute
4270                                                      unit.
4271
4272                                                      CP is responsible for
4273                                                      filling in
4274                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
4275      25      1 bit   CDBG_USER                       Must be 0.
4276
4277                                                      Flag that can be used to
4278                                                      control debugging code.
4279
4280                                                      CP is responsible for
4281                                                      filling in
4282                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4283      26      1 bit   FP16_OVFL                       GFX6-GFX8
4284                                                        Reserved, must be 0.
4285                                                      GFX9-GFX11
4286                                                        Wavefront starts execution
4287                                                        with specified fp16 overflow
4288                                                        mode.
4289
4290                                                        - If 0, fp16 overflow generates
4291                                                          +/-INF values.
4292                                                        - If 1, fp16 overflow that is the
4293                                                          result of an +/-INF input value
4294                                                          or divide by 0 produces a +/-INF,
4295                                                          otherwise clamps computed
4296                                                          overflow to +/-MAX_FP16 as
4297                                                          appropriate.
4298
4299                                                        Used by CP to set up
4300                                                        ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4301      28:27   2 bits                                  Reserved, must be 0.
4302      29      1 bit    WGP_MODE                       GFX6-GFX9
4303                                                        Reserved, must be 0.
4304                                                      GFX10-GFX11
4305                                                        - If 0 execute work-groups in
4306                                                          CU wavefront execution mode.
4307                                                        - If 1 execute work-groups on
4308                                                          in WGP wavefront execution mode.
4309
4310                                                        See :ref:`amdgpu-amdhsa-memory-model`.
4311
4312                                                        Used by CP to set up
4313                                                        ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4314      30      1 bit    MEM_ORDERED                    GFX6-GFX9
4315                                                        Reserved, must be 0.
4316                                                      GFX10-GFX11
4317                                                        Controls the behavior of the
4318                                                        s_waitcnt's vmcnt and vscnt
4319                                                        counters.
4320
4321                                                        - If 0 vmcnt reports completion
4322                                                          of load and atomic with return
4323                                                          out of order with sample
4324                                                          instructions, and the vscnt
4325                                                          reports the completion of
4326                                                          store and atomic without
4327                                                          return in order.
4328                                                        - If 1 vmcnt reports completion
4329                                                          of load, atomic with return
4330                                                          and sample instructions in
4331                                                          order, and the vscnt reports
4332                                                          the completion of store and
4333                                                          atomic without return in order.
4334
4335                                                        Used by CP to set up
4336                                                        ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4337      31      1 bit    FWD_PROGRESS                   GFX6-GFX9
4338                                                        Reserved, must be 0.
4339                                                      GFX10-GFX11
4340                                                        - If 0 execute SIMD wavefronts
4341                                                          using oldest first policy.
4342                                                        - If 1 execute SIMD wavefronts to
4343                                                          ensure wavefronts will make some
4344                                                          forward progress.
4345
4346                                                        Used by CP to set up
4347                                                        ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4348      32      **Total size 4 bytes**
4349      ======= ===================================================================================================================
4350
4351 ..
4352
4353   .. table:: compute_pgm_rsrc2 for GFX6-GFX11
4354      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table
4355
4356      ======= ======= =============================== ===========================================================================
4357      Bits    Size    Field Name                      Description
4358      ======= ======= =============================== ===========================================================================
4359      0       1 bit   ENABLE_PRIVATE_SEGMENT          * Enable the setup of the
4360                                                        private segment.
4361                                                      * If the *Target Properties*
4362                                                        column of
4363                                                        :ref:`amdgpu-processor-table`
4364                                                        does not specify
4365                                                        *Architected flat
4366                                                        scratch* then enable the
4367                                                        setup of the SGPR
4368                                                        wavefront scratch offset
4369                                                        system register (see
4370                                                        :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4371                                                      * If the *Target Properties*
4372                                                        column of
4373                                                        :ref:`amdgpu-processor-table`
4374                                                        specifies *Architected
4375                                                        flat scratch* then enable
4376                                                        the setup of the
4377                                                        FLAT_SCRATCH register
4378                                                        pair (see
4379                                                        :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4380
4381                                                      Used by CP to set up
4382                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4383      5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
4384                                                      user data
4385                                                      registers requested. This
4386                                                      number must be greater than
4387                                                      or equal to the number of user
4388                                                      data registers enabled.
4389
4390                                                      Used by CP to set up
4391                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4392      6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
4393
4394                                                      This bit represents
4395                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4396                                                      which is set by the CP if
4397                                                      the runtime has installed a
4398                                                      trap handler.
4399      7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
4400                                                      system SGPR register for
4401                                                      the work-group id in the X
4402                                                      dimension (see
4403                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4404
4405                                                      Used by CP to set up
4406                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4407      8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
4408                                                      system SGPR register for
4409                                                      the work-group id in the Y
4410                                                      dimension (see
4411                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4412
4413                                                      Used by CP to set up
4414                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4415      9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
4416                                                      system SGPR register for
4417                                                      the work-group id in the Z
4418                                                      dimension (see
4419                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4420
4421                                                      Used by CP to set up
4422                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4423      10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
4424                                                      system SGPR register for
4425                                                      work-group information (see
4426                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4427
4428                                                      Used by CP to set up
4429                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4430      12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
4431                                                      VGPR system registers used
4432                                                      for the work-item ID.
4433                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4434                                                      defines the values.
4435
4436                                                      Used by CP to set up
4437                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4438      13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
4439
4440                                                      Wavefront starts execution
4441                                                      with address watch
4442                                                      exceptions enabled which
4443                                                      are generated when L1 has
4444                                                      witnessed a thread access
4445                                                      an *address of
4446                                                      interest*.
4447
4448                                                      CP is responsible for
4449                                                      filling in the address
4450                                                      watch bit in
4451                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4452                                                      according to what the
4453                                                      runtime requests.
4454      14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
4455
4456                                                      Wavefront starts execution
4457                                                      with memory violation
4458                                                      exceptions exceptions
4459                                                      enabled which are generated
4460                                                      when a memory violation has
4461                                                      occurred for this wavefront from
4462                                                      L1 or LDS
4463                                                      (write-to-read-only-memory,
4464                                                      mis-aligned atomic, LDS
4465                                                      address out of range,
4466                                                      illegal address, etc.).
4467
4468                                                      CP sets the memory
4469                                                      violation bit in
4470                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4471                                                      according to what the
4472                                                      runtime requests.
4473      23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
4474
4475                                                      CP uses the rounded value
4476                                                      from the dispatch packet,
4477                                                      not this value, as the
4478                                                      dispatch may contain
4479                                                      dynamically allocated group
4480                                                      segment memory. CP writes
4481                                                      directly to
4482                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4483
4484                                                      Amount of group segment
4485                                                      (LDS) to allocate for each
4486                                                      work-group. Granularity is
4487                                                      device specific:
4488
4489                                                      GFX6
4490                                                        roundup(lds-size / (64 * 4))
4491                                                      GFX7-GFX11
4492                                                        roundup(lds-size / (128 * 4))
4493
4494      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
4495                      _INVALID_OPERATION              with specified exceptions
4496                                                      enabled.
4497
4498                                                      Used by CP to set up
4499                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
4500                                                      (set from bits 0..6).
4501
4502                                                      IEEE 754 FP Invalid
4503                                                      Operation
4504      25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
4505                      _SOURCE                         input operands is a
4506                                                      denormal number
4507      26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
4508                      _DIVISION_BY_ZERO               Zero
4509      27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
4510                      _OVERFLOW
4511      28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
4512                      _UNDERFLOW
4513      29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
4514                      _INEXACT
4515      30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
4516                      _ZERO                           (rcp_iflag_f32 instruction
4517                                                      only)
4518      31      1 bit                                   Reserved, must be 0.
4519      32      **Total size 4 bytes.**
4520      ======= ===================================================================================================================
4521
4522 ..
4523
4524   .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
4525      :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4526
4527      ======= ======= =============================== ===========================================================================
4528      Bits    Size    Field Name                      Description
4529      ======= ======= =============================== ===========================================================================
4530      5:0     6 bits  ACCUM_OFFSET                    Offset of a first AccVGPR in the unified register file. Granularity 4.
4531                                                      Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4532                                                      63 - accum-offset = 256.
4533      6:15    10                                      Reserved, must be 0.
4534              bits
4535      16      1 bit   TG_SPLIT                        - If 0 the waves of a work-group are
4536                                                        launched in the same CU.
4537                                                      - If 1 the waves of a work-group can be
4538                                                        launched in different CUs. The waves
4539                                                        cannot use S_BARRIER or LDS.
4540      17:31   15                                      Reserved, must be 0.
4541              bits
4542      32      **Total size 4 bytes.**
4543      ======= ===================================================================================================================
4544
4545 ..
4546
4547   .. table:: compute_pgm_rsrc3 for GFX10-GFX11
4548      :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table
4549
4550      ======= ======= =============================== ===========================================================================
4551      Bits    Size    Field Name                      Description
4552      ======= ======= =============================== ===========================================================================
4553      3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPR blocks when executing in subvector mode. For
4554                                                      wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
4555                                                      of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
4556                                                      not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
4557      9:4     6 bits  INST_PREF_SIZE                  GFX10
4558                                                        Reserved, must be 0.
4559                                                      GFX11
4560                                                        Number of instruction bytes to prefetch, starting at the kernel's entry
4561                                                        point instruction, before wavefront starts execution. The value is 0..63
4562                                                        with a granularity of 128 bytes.
4563      10      1 bit   TRAP_ON_START                   GFX10
4564                                                        Reserved, must be 0.
4565                                                      GFX11
4566                                                        Must be 0.
4567
4568                                                        If 1, wavefront starts execution by trapping into the trap handler.
4569
4570                                                        CP is responsible for filling in the trap on start bit in
4571                                                        ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
4572                                                        requests.
4573      11      1 bit   TRAP_ON_END                     GFX10
4574                                                        Reserved, must be 0.
4575                                                      GFX11
4576                                                        Must be 0.
4577
4578                                                        If 1, wavefront execution terminates by trapping into the trap handler.
4579
4580                                                        CP is responsible for filling in the trap on end bit in
4581                                                        ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
4582      30:12   19 bits                                 Reserved, must be 0.
4583      31      1 bit   IMAGE_OP                        GFX10
4584                                                        Reserved, must be 0.
4585                                                      GFX11
4586                                                        If 1, the kernel execution contains image instructions. If executed as
4587                                                        part of a graphics pipeline, image read instructions will stall waiting
4588                                                        for any necessary ``WAIT_SYNC`` fence to be performed in order to
4589                                                        indicate that earlier pipeline stages have completed writing to the
4590                                                        image.
4591
4592                                                        Not used for compute kernels that are not part of a graphics pipeline and
4593                                                        must be 0.
4594      32      **Total size 4 bytes.**
4595      ======= ===================================================================================================================
4596
4597 ..
4598
4599   .. table:: Floating Point Rounding Mode Enumeration Values
4600      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4601
4602      ====================================== ===== ==============================
4603      Enumeration Name                       Value Description
4604      ====================================== ===== ==============================
4605      FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
4606      FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
4607      FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
4608      FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
4609      ====================================== ===== ==============================
4610
4611 ..
4612
4613   .. table:: Floating Point Denorm Mode Enumeration Values
4614      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4615
4616      ====================================== ===== ==============================
4617      Enumeration Name                       Value Description
4618      ====================================== ===== ==============================
4619      FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
4620                                                   Denorms
4621      FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
4622      FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
4623      FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
4624      ====================================== ===== ==============================
4625
4626 ..
4627
4628   .. table:: System VGPR Work-Item ID Enumeration Values
4629      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4630
4631      ======================================== ===== ============================
4632      Enumeration Name                         Value Description
4633      ======================================== ===== ============================
4634      SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
4635                                                     ID.
4636      SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
4637                                                     dimensions ID.
4638      SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
4639                                                     dimensions ID.
4640      SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
4641      ======================================== ===== ============================
4642
4643 .. _amdgpu-amdhsa-initial-kernel-execution-state:
4644
4645 Initial Kernel Execution State
4646 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4647
4648 This section defines the register state that will be set up by the packet
4649 processor prior to the start of execution of every wavefront. This is limited by
4650 the constraints of the hardware controllers of CP/ADC/SPI.
4651
4652 The order of the SGPR registers is defined, but the compiler can specify which
4653 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4654 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4655 for enabled registers are dense starting at SGPR0: the first enabled register is
4656 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4657 an SGPR number.
4658
4659 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4660 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4661 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4662 actually initialized. These are then immediately followed by the System SGPRs
4663 that are set up by ADC/SPI and can have different values for each wavefront of
4664 the grid dispatch.
4665
4666 SGPR register initial state is defined in
4667 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4668
4669   .. table:: SGPR Register Set Up Order
4670      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4671
4672      ========== ========================== ====== ==============================
4673      SGPR Order Name                       Number Description
4674                 (kernel descriptor enable  of
4675                 field)                     SGPRs
4676      ========== ========================== ====== ==============================
4677      First      Private Segment Buffer     4      See
4678                 (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4679                 _segment_buffer)
4680      then       Dispatch Ptr               2      64-bit address of AQL dispatch
4681                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
4682                                                   actually executing.
4683      then       Queue Ptr                  2      64-bit address of amd_queue_t
4684                 (enable_sgpr_queue_ptr)           object for AQL queue on which
4685                                                   the dispatch packet was
4686                                                   queued.
4687      then       Kernarg Segment Ptr        2      64-bit address of Kernarg
4688                 (enable_sgpr_kernarg              segment. This is directly
4689                 _segment_ptr)                     copied from the
4690                                                   kernarg_address in the kernel
4691                                                   dispatch packet.
4692
4693                                                   Having CP load it once avoids
4694                                                   loading it at the beginning of
4695                                                   every wavefront.
4696      then       Dispatch Id                2      64-bit Dispatch ID of the
4697                 (enable_sgpr_dispatch_id)         dispatch packet being
4698                                                   executed.
4699      then       Flat Scratch Init          2      See
4700                 (enable_sgpr_flat_scratch         :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4701                 _init)
4702      then       Private Segment Size       1      The 32-bit byte size of a
4703                 (enable_sgpr_private              single work-item's memory
4704                 _segment_size)                    allocation. This is the
4705                                                   value from the kernel
4706                                                   dispatch packet Private
4707                                                   Segment Byte Size rounded up
4708                                                   by CP to a multiple of
4709                                                   DWORD.
4710
4711                                                   Having CP load it once avoids
4712                                                   loading it at the beginning of
4713                                                   every wavefront.
4714
4715                                                   This is not used for
4716                                                   GFX7-GFX8 since it is the same
4717                                                   value as the second SGPR of
4718                                                   Flat Scratch Init. However, it
4719                                                   may be needed for GFX9-GFX11 which
4720                                                   changes the meaning of the
4721                                                   Flat Scratch Init value.
4722      then       Work-Group Id X            1      32-bit work-group id in X
4723                 (enable_sgpr_workgroup_id         dimension of grid for
4724                 _X)                               wavefront.
4725      then       Work-Group Id Y            1      32-bit work-group id in Y
4726                 (enable_sgpr_workgroup_id         dimension of grid for
4727                 _Y)                               wavefront.
4728      then       Work-Group Id Z            1      32-bit work-group id in Z
4729                 (enable_sgpr_workgroup_id         dimension of grid for
4730                 _Z)                               wavefront.
4731      then       Work-Group Info            1      {first_wavefront, 14'b0000,
4732                 (enable_sgpr_workgroup            ordered_append_term[10:0],
4733                 _info)                            threadgroup_size_in_wavefronts[5:0]}
4734      then       Scratch Wavefront Offset   1      See
4735                 (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4736                 _segment_wavefront_offset)        and
4737                                                   :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4738      ========== ========================== ====== ==============================
4739
4740 The order of the VGPR registers is defined, but the compiler can specify which
4741 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4742 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4743 for enabled registers are dense starting at VGPR0: the first enabled register is
4744 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4745 VGPR number.
4746
4747 There are different methods used for the VGPR initial state:
4748
4749 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4750   specifies otherwise, a separate VGPR register is used per work-item ID. The
4751   VGPR register initial state for this method is defined in
4752   :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4753 * If *Target Properties* column of :ref:`amdgpu-processor-table`
4754   specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4755   for all work-item IDs. The register layout for this method is defined in
4756   :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4757
4758   .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4759      :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4760
4761      ========== ========================== ====== ==============================
4762      VGPR Order Name                       Number Description
4763                 (kernel descriptor enable  of
4764                 field)                     VGPRs
4765      ========== ========================== ====== ==============================
4766      First      Work-Item Id X             1      32-bit work-item id in X
4767                 (Always initialized)              dimension of work-group for
4768                                                   wavefront lane.
4769      then       Work-Item Id Y             1      32-bit work-item id in Y
4770                 (enable_vgpr_workitem_id          dimension of work-group for
4771                 > 0)                              wavefront lane.
4772      then       Work-Item Id Z             1      32-bit work-item id in Z
4773                 (enable_vgpr_workitem_id          dimension of work-group for
4774                 > 1)                              wavefront lane.
4775      ========== ========================== ====== ==============================
4776
4777 ..
4778
4779   .. table:: Register Layout for Packed Work-Item ID Method
4780      :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4781
4782      ======= ======= ================ =========================================
4783      Bits    Size    Field Name       Description
4784      ======= ======= ================ =========================================
4785      0:9     10 bits Work-Item Id X   Work-item id in X
4786                                       dimension of work-group for
4787                                       wavefront lane.
4788
4789                                       Always initialized.
4790
4791      10:19   10 bits Work-Item Id Y   Work-item id in Y
4792                                       dimension of work-group for
4793                                       wavefront lane.
4794
4795                                       Initialized if enable_vgpr_workitem_id >
4796                                       0, otherwise set to 0.
4797      20:29   10 bits Work-Item Id Z   Work-item id in Z
4798                                       dimension of work-group for
4799                                       wavefront lane.
4800
4801                                       Initialized if enable_vgpr_workitem_id >
4802                                       1, otherwise set to 0.
4803      30:31   2 bits                   Reserved, set to 0.
4804      ======= ======= ================ =========================================
4805
4806 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4807
4808 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4809    registers.
4810 2. Work-group Id registers X, Y, Z are set by ADC which supports any
4811    combination including none.
4812 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4813    its value cannot be included with the flat scratch init value which is per
4814    queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
4815 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4816    or (X, Y, Z).
4817 5. Flat Scratch register pair initialization is described in
4818    :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4819
4820 The global segment can be accessed either using buffer instructions (GFX6 which
4821 has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
4822 instructions (GFX9-GFX11).
4823
4824 If buffer operations are used, then the compiler can generate a V# with the
4825 following properties:
4826
4827 * base address of 0
4828 * no swizzle
4829 * ATC: 1 if IOMMU present (such as APU)
4830 * ptr64: 1
4831 * MTYPE set to support memory coherence that matches the runtime (such as CC for
4832   APU and NC for dGPU).
4833
4834 .. _amdgpu-amdhsa-kernel-prolog:
4835
4836 Kernel Prolog
4837 ~~~~~~~~~~~~~
4838
4839 The compiler performs initialization in the kernel prologue depending on the
4840 target and information about things like stack usage in the kernel and called
4841 functions. Some of this initialization requires the compiler to request certain
4842 User and System SGPRs be present in the
4843 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4844 :ref:`amdgpu-amdhsa-kernel-descriptor`.
4845
4846 .. _amdgpu-amdhsa-kernel-prolog-cfi:
4847
4848 CFI
4849 +++
4850
4851 1.  The CFI return address is undefined.
4852
4853 2.  The CFI CFA is defined using an expression which evaluates to a location
4854     description that comprises one memory location description for the
4855     ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4856
4857 .. _amdgpu-amdhsa-kernel-prolog-m0:
4858
4859 M0
4860 ++
4861
4862 GFX6-GFX8
4863   The M0 register must be initialized with a value at least the total LDS size
4864   if the kernel may access LDS via DS or flat operations. Total LDS size is
4865   available in dispatch packet. For M0, it is also possible to use maximum
4866   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4867   GFX7-GFX8).
4868 GFX9-GFX11
4869   The M0 register is not used for range checking LDS accesses and so does not
4870   need to be initialized in the prolog.
4871
4872 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4873
4874 Stack Pointer
4875 +++++++++++++
4876
4877 If the kernel has function calls it must set up the ABI stack pointer described
4878 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4879 SGPR32 to the unswizzled scratch offset of the address past the last local
4880 allocation.
4881
4882 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4883
4884 Frame Pointer
4885 +++++++++++++
4886
4887 If the kernel needs a frame pointer for the reasons defined in
4888 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4889 kernel prolog. If a frame pointer is not required then all uses of the frame
4890 pointer are replaced with immediate ``0`` offsets.
4891
4892 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4893
4894 Flat Scratch
4895 ++++++++++++
4896
4897 There are different methods used for initializing flat scratch:
4898
4899 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4900   specifies *Does not support generic address space*:
4901
4902   Flat scratch is not supported and there is no flat scratch register pair.
4903
4904 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4905   specifies *Offset flat scratch*:
4906
4907   If the kernel or any function it calls may use flat operations to access
4908   scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4909   (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4910   Scratch Wavefront Offset SGPR registers (see
4911   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4912
4913   1. The low word of Flat Scratch Init is the 32-bit byte offset from
4914      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4915      being managed by SPI for the queue executing the kernel dispatch. This is
4916      the same value used in the Scratch Segment Buffer V# base address.
4917
4918      CP obtains this from the runtime. (The Scratch Segment Buffer base address
4919      is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4920
4921      The prolog must add the value of Scratch Wavefront Offset to get the
4922      wavefront's byte scratch backing memory offset from
4923      ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4924
4925      The Scratch Wavefront Offset must also be used as an offset with Private
4926      segment address when using the Scratch Segment Buffer.
4927
4928      Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4929      shifted by 8 before moving into FLAT_SCRATCH_HI.
4930
4931      FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4932      SGPRn is the highest numbered SGPR allocated to the wavefront).
4933      FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4934      added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4935      FLAT SCRATCH BASE in flat memory instructions that access the scratch
4936      aperture.
4937   2. The second word of Flat Scratch Init is 32-bit byte size of a single
4938      work-items scratch memory usage.
4939
4940      CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4941      checks that the value in the kernel dispatch packet Private Segment Byte
4942      Size is not larger and requests the runtime to increase the queue's scratch
4943      size if necessary.
4944
4945      CP directly loads from the kernel dispatch packet Private Segment Byte Size
4946      field and rounds up to a multiple of DWORD. Having CP load it once avoids
4947      loading it at the beginning of every wavefront.
4948
4949      The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
4950      GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
4951      in flat memory instructions.
4952
4953 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4954   specifies *Absolute flat scratch*:
4955
4956   If the kernel or any function it calls may use flat operations to access
4957   scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4958   (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
4959   uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
4960   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4961
4962   The Flat Scratch Init is the 64-bit address of the base of scratch backing
4963   memory being managed by SPI for the queue executing the kernel dispatch.
4964
4965   CP obtains this from the runtime.
4966
4967   The kernel prolog must add the value of the wave's Scratch Wavefront Offset
4968   and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
4969   which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
4970   memory instructions.
4971
4972   The Scratch Wavefront Offset must also be used as an offset with Private
4973   segment address when using the Scratch Segment Buffer (see
4974   :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
4975
4976 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4977   specifies *Architected flat scratch*:
4978
4979   If ENABLE_PRIVATE_SEGMENT is enabled in
4980   :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` then the FLAT_SCRATCH
4981   register pair will be initialized to the 64-bit address of the base of scratch
4982   backing memory being managed by SPI for the queue executing the kernel
4983   dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
4984   flat scratch base in flat memory instructions.
4985
4986 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
4987
4988 Private Segment Buffer
4989 ++++++++++++++++++++++
4990
4991 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
4992 *Architected flat scratch* then a Private Segment Buffer is not supported.
4993 Instead the flat SCRATCH instructions are used.
4994
4995 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
4996 that are used as a V# to access scratch. CP uses the value provided by the
4997 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
4998 access the private memory space using a segment address. See
4999 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
5000
5001 The scratch V# is a four-aligned SGPR and always selected for the kernel as
5002 follows:
5003
5004   - If it is known during instruction selection that there is stack usage,
5005     SGPR0-3 is reserved for use as the scratch V#.  Stack usage is assumed if
5006     optimizations are disabled (``-O0``), if stack objects already exist (for
5007     locals, etc.), or if there are any function calls.
5008
5009   - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
5010     are reserved for the tentative scratch V#. These will be used if it is
5011     determined that spilling is needed.
5012
5013     - If no use is made of the tentative scratch V#, then it is unreserved,
5014       and the register count is determined ignoring it.
5015     - If use is made of the tentative scratch V#, then its register numbers
5016       are shifted to the first four-aligned SGPR index after the highest one
5017       allocated by the register allocator, and all uses are updated. The
5018       register count includes them in the shifted location.
5019     - In either case, if the processor has the SGPR allocation bug, the
5020       tentative allocation is not shifted or unreserved in order to ensure
5021       the register count is higher to workaround the bug.
5022
5023     .. note::
5024
5025       This approach of using a tentative scratch V# and shifting the register
5026       numbers if used avoids having to perform register allocation a second
5027       time if the tentative V# is eliminated. This is more efficient and
5028       avoids the problem that the second register allocation may perform
5029       spilling which will fail as there is no longer a scratch V#.
5030
5031 When the kernel prolog code is being emitted it is known whether the scratch V#
5032 described above is actually used. If it is, the prolog code must set it up by
5033 copying the Private Segment Buffer to the scratch V# registers and then adding
5034 the Private Segment Wavefront Offset to the queue base address in the V#. The
5035 result is a V# with a base address pointing to the beginning of the wavefront
5036 scratch backing memory.
5037
5038 The Private Segment Buffer is always requested, but the Private Segment
5039 Wavefront Offset is only requested if it is used (see
5040 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5041
5042 .. _amdgpu-amdhsa-memory-model:
5043
5044 Memory Model
5045 ~~~~~~~~~~~~
5046
5047 This section describes the mapping of the LLVM memory model onto AMDGPU machine
5048 code (see :ref:`memmodel`).
5049
5050 The AMDGPU backend supports the memory synchronization scopes specified in
5051 :ref:`amdgpu-memory-scopes`.
5052
5053 The code sequences used to implement the memory model specify the order of
5054 instructions that a single thread must execute. The ``s_waitcnt`` and cache
5055 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
5056 to other memory instructions executed by the same thread. This allows them to be
5057 moved earlier or later which can allow them to be combined with other instances
5058 of the same instruction, or hoisted/sunk out of loops to improve performance.
5059 Only the instructions related to the memory model are given; additional
5060 ``s_waitcnt`` instructions are required to ensure registers are defined before
5061 being used. These may be able to be combined with the memory model ``s_waitcnt``
5062 instructions as described above.
5063
5064 The AMDGPU backend supports the following memory models:
5065
5066   HSA Memory Model [HSA]_
5067     The HSA memory model uses a single happens-before relation for all address
5068     spaces (see :ref:`amdgpu-address-spaces`).
5069   OpenCL Memory Model [OpenCL]_
5070     The OpenCL memory model which has separate happens-before relations for the
5071     global and local address spaces. Only a fence specifying both global and
5072     local address space, and seq_cst instructions join the relationships. Since
5073     the LLVM ``memfence`` instruction does not allow an address space to be
5074     specified the OpenCL fence has to conservatively assume both local and
5075     global address space was specified. However, optimizations can often be
5076     done to eliminate the additional ``s_waitcnt`` instructions when there are
5077     no intervening memory instructions which access the corresponding address
5078     space. The code sequences in the table indicate what can be omitted for the
5079     OpenCL memory. The target triple environment is used to determine if the
5080     source language is OpenCL (see :ref:`amdgpu-opencl`).
5081
5082 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
5083 operations.
5084
5085 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
5086 termed vector memory operations.
5087
5088 Private address space uses ``buffer_load/store`` using the scratch V#
5089 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
5090 is accessing the memory, atomic memory orderings are not meaningful, and all
5091 accesses are treated as non-atomic.
5092
5093 Constant address space uses ``buffer/global_load`` instructions (or equivalent
5094 scalar memory instructions). Since the constant address space contents do not
5095 change during the execution of a kernel dispatch it is not legal to perform
5096 stores, and atomic memory orderings are not meaningful, and all accesses are
5097 treated as non-atomic.
5098
5099 A memory synchronization scope wider than work-group is not meaningful for the
5100 group (LDS) address space and is treated as work-group.
5101
5102 The memory model does not support the region address space which is treated as
5103 non-atomic.
5104
5105 Acquire memory ordering is not meaningful on store atomic instructions and is
5106 treated as non-atomic.
5107
5108 Release memory ordering is not meaningful on load atomic instructions and is
5109 treated a non-atomic.
5110
5111 Acquire-release memory ordering is not meaningful on load or store atomic
5112 instructions and is treated as acquire and release respectively.
5113
5114 The memory order also adds the single thread optimization constraints defined in
5115 table
5116 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
5117
5118   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
5119      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
5120
5121      ============ ==============================================================
5122      LLVM Memory  Optimization Constraints
5123      Ordering
5124      ============ ==============================================================
5125      unordered    *none*
5126      monotonic    *none*
5127      acquire      - If a load atomic/atomicrmw then no following load/load
5128                     atomic/store/store atomic/atomicrmw/fence instruction can be
5129                     moved before the acquire.
5130                   - If a fence then same as load atomic, plus no preceding
5131                     associated fence-paired-atomic can be moved after the fence.
5132      release      - If a store atomic/atomicrmw then no preceding load/load
5133                     atomic/store/store atomic/atomicrmw/fence instruction can be
5134                     moved after the release.
5135                   - If a fence then same as store atomic, plus no following
5136                     associated fence-paired-atomic can be moved before the
5137                     fence.
5138      acq_rel      Same constraints as both acquire and release.
5139      seq_cst      - If a load atomic then same constraints as acquire, plus no
5140                     preceding sequentially consistent load atomic/store
5141                     atomic/atomicrmw/fence instruction can be moved after the
5142                     seq_cst.
5143                   - If a store atomic then the same constraints as release, plus
5144                     no following sequentially consistent load atomic/store
5145                     atomic/atomicrmw/fence instruction can be moved before the
5146                     seq_cst.
5147                   - If an atomicrmw/fence then same constraints as acq_rel.
5148      ============ ==============================================================
5149
5150 The code sequences used to implement the memory model are defined in the
5151 following sections:
5152
5153 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5154 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5155 * :ref:`amdgpu-amdhsa-memory-model-gfx940`
5156 * :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
5157
5158 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5159
5160 Memory Model GFX6-GFX9
5161 ++++++++++++++++++++++
5162
5163 For GFX6-GFX9:
5164
5165 * Each agent has multiple shader arrays (SA).
5166 * Each SA has multiple compute units (CU).
5167 * Each CU has multiple SIMDs that execute wavefronts.
5168 * The wavefronts for a single work-group are executed in the same CU but may be
5169   executed by different SIMDs.
5170 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
5171   executing on it.
5172 * All LDS operations of a CU are performed as wavefront wide operations in a
5173   global order and involve no caching. Completion is reported to a wavefront in
5174   execution order.
5175 * The LDS memory has multiple request queues shared by the SIMDs of a
5176   CU. Therefore, the LDS operations performed by different wavefronts of a
5177   work-group can be reordered relative to each other, which can result in
5178   reordering the visibility of vector memory operations with respect to LDS
5179   operations of other wavefronts in the same work-group. A ``s_waitcnt
5180   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5181   vector memory operations between wavefronts of a work-group, but not between
5182   operations performed by the same wavefront.
5183 * The vector memory operations are performed as wavefront wide operations and
5184   completion is reported to a wavefront in execution order. The exception is
5185   that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5186   vector memory order if they access LDS memory, and out of LDS operation order
5187   if they access global memory.
5188 * The vector memory operations access a single vector L1 cache shared by all
5189   SIMDs a CU. Therefore, no special action is required for coherence between the
5190   lanes of a single wavefront, or for coherence between wavefronts in the same
5191   work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5192   wavefronts executing in different work-groups as they may be executing on
5193   different CUs.
5194 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
5195   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5196   scalar operations are used in a restricted way so do not impact the memory
5197   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5198 * The vector and scalar memory operations use an L2 cache shared by all CUs on
5199   the same agent.
5200 * The L2 cache has independent channels to service disjoint ranges of virtual
5201   addresses.
5202 * Each CU has a separate request queue per channel. Therefore, the vector and
5203   scalar memory operations performed by wavefronts executing in different
5204   work-groups (which may be executing on different CUs) of an agent can be
5205   reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
5206   ensure synchronization between vector memory operations of different CUs. It
5207   ensures a previous vector memory operation has completed before executing a
5208   subsequent vector memory or LDS operation and so can be used to meet the
5209   requirements of acquire and release.
5210 * The L2 cache can be kept coherent with other agents on some targets, or ranges
5211   of virtual addresses can be set up to bypass it to ensure system coherence.
5212
5213 Scalar memory operations are only used to access memory that is proven to not
5214 change during the execution of the kernel dispatch. This includes constant
5215 address space and global address space for program scope ``const`` variables.
5216 Therefore, the kernel machine code does not have to maintain the scalar cache to
5217 ensure it is coherent with the vector caches. The scalar and vector caches are
5218 invalidated between kernel dispatches by CP since constant address space data
5219 may change between kernel dispatch executions. See
5220 :ref:`amdgpu-amdhsa-memory-spaces`.
5221
5222 The one exception is if scalar writes are used to spill SGPR registers. In this
5223 case the AMDGPU backend ensures the memory location used to spill is never
5224 accessed by vector memory operations at the same time. If scalar writes are used
5225 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
5226 return since the locations may be used for vector memory instructions by a
5227 future wavefront that uses the same scratch area, or a function call that
5228 creates a frame at the same address, respectively. There is no need for a
5229 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
5230
5231 For kernarg backing memory:
5232
5233 * CP invalidates the L1 cache at the start of each kernel dispatch.
5234 * On dGPU the kernarg backing memory is allocated in host memory accessed as
5235   MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
5236   causes it to be treated as non-volatile and so is not invalidated by
5237   ``*_vol``.
5238 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
5239   and so the L2 cache will be coherent with the CPU and other agents.
5240
5241 Scratch backing memory (which is used for the private address space) is accessed
5242 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
5243 only accessed by a single thread, and is always write-before-read, there is
5244 never a need to invalidate these entries from the L1 cache. Hence all cache
5245 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
5246
5247 The code sequences used to implement the memory model for GFX6-GFX9 are defined
5248 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
5249
5250   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
5251      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
5252
5253      ============ ============ ============== ========== ================================
5254      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
5255                   Ordering     Sync Scope     Address    GFX6-GFX9
5256                                               Space
5257      ============ ============ ============== ========== ================================
5258      **Non-Atomic**
5259      ------------------------------------------------------------------------------------
5260      load         *none*       *none*         - global   - !volatile & !nontemporal
5261                                               - generic
5262                                               - private    1. buffer/global/flat_load
5263                                               - constant
5264                                                          - !volatile & nontemporal
5265
5266                                                            1. buffer/global/flat_load
5267                                                               glc=1 slc=1
5268
5269                                                          - volatile
5270
5271                                                            1. buffer/global/flat_load
5272                                                               glc=1
5273                                                            2. s_waitcnt vmcnt(0)
5274
5275                                                             - Must happen before
5276                                                               any following volatile
5277                                                               global/generic
5278                                                               load/store.
5279                                                             - Ensures that
5280                                                               volatile
5281                                                               operations to
5282                                                               different
5283                                                               addresses will not
5284                                                               be reordered by
5285                                                               hardware.
5286
5287      load         *none*       *none*         - local    1. ds_load
5288      store        *none*       *none*         - global   - !volatile & !nontemporal
5289                                               - generic
5290                                               - private    1. buffer/global/flat_store
5291                                               - constant
5292                                                          - !volatile & nontemporal
5293
5294                                                            1. buffer/global/flat_store
5295                                                               glc=1 slc=1
5296
5297                                                          - volatile
5298
5299                                                            1. buffer/global/flat_store
5300                                                            2. s_waitcnt vmcnt(0)
5301
5302                                                             - Must happen before
5303                                                               any following volatile
5304                                                               global/generic
5305                                                               load/store.
5306                                                             - Ensures that
5307                                                               volatile
5308                                                               operations to
5309                                                               different
5310                                                               addresses will not
5311                                                               be reordered by
5312                                                               hardware.
5313
5314      store        *none*       *none*         - local    1. ds_store
5315      **Unordered Atomic**
5316      ------------------------------------------------------------------------------------
5317      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
5318      store atomic unordered    *any*          *any*      *Same as non-atomic*.
5319      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
5320      **Monotonic Atomic**
5321      ------------------------------------------------------------------------------------
5322      load atomic  monotonic    - singlethread - global   1. buffer/global/ds/flat_load
5323                                - wavefront    - local
5324                                - workgroup    - generic
5325      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
5326                                - system       - generic     glc=1
5327      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
5328                                - wavefront    - generic
5329                                - workgroup
5330                                - agent
5331                                - system
5332      store atomic monotonic    - singlethread - local    1. ds_store
5333                                - wavefront
5334                                - workgroup
5335      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
5336                                - wavefront    - generic
5337                                - workgroup
5338                                - agent
5339                                - system
5340      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
5341                                - wavefront
5342                                - workgroup
5343      **Acquire Atomic**
5344      ------------------------------------------------------------------------------------
5345      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
5346                                - wavefront    - local
5347                                               - generic
5348      load atomic  acquire      - workgroup    - global   1. buffer/global_load
5349      load atomic  acquire      - workgroup    - local    1. ds/flat_load
5350                                               - generic  2. s_waitcnt lgkmcnt(0)
5351
5352                                                            - If OpenCL, omit.
5353                                                            - Must happen before
5354                                                              any following
5355                                                              global/generic
5356                                                              load/load
5357                                                              atomic/store/store
5358                                                              atomic/atomicrmw.
5359                                                            - Ensures any
5360                                                              following global
5361                                                              data read is no
5362                                                              older than a local load
5363                                                              atomic value being
5364                                                              acquired.
5365
5366      load atomic  acquire      - agent        - global   1. buffer/global_load
5367                                - system                     glc=1
5368                                                          2. s_waitcnt vmcnt(0)
5369
5370                                                            - Must happen before
5371                                                              following
5372                                                              buffer_wbinvl1_vol.
5373                                                            - Ensures the load
5374                                                              has completed
5375                                                              before invalidating
5376                                                              the cache.
5377
5378                                                          3. buffer_wbinvl1_vol
5379
5380                                                            - Must happen before
5381                                                              any following
5382                                                              global/generic
5383                                                              load/load
5384                                                              atomic/atomicrmw.
5385                                                            - Ensures that
5386                                                              following
5387                                                              loads will not see
5388                                                              stale global data.
5389
5390      load atomic  acquire      - agent        - generic  1. flat_load glc=1
5391                                - system                  2. s_waitcnt vmcnt(0) &
5392                                                             lgkmcnt(0)
5393
5394                                                            - If OpenCL omit
5395                                                              lgkmcnt(0).
5396                                                            - Must happen before
5397                                                              following
5398                                                              buffer_wbinvl1_vol.
5399                                                            - Ensures the flat_load
5400                                                              has completed
5401                                                              before invalidating
5402                                                              the cache.
5403
5404                                                          3. buffer_wbinvl1_vol
5405
5406                                                            - Must happen before
5407                                                              any following
5408                                                              global/generic
5409                                                              load/load
5410                                                              atomic/atomicrmw.
5411                                                            - Ensures that
5412                                                              following loads
5413                                                              will not see stale
5414                                                              global data.
5415
5416      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
5417                                - wavefront    - local
5418                                               - generic
5419      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
5420      atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
5421                                               - generic  2. s_waitcnt lgkmcnt(0)
5422
5423                                                            - If OpenCL, omit.
5424                                                            - Must happen before
5425                                                              any following
5426                                                              global/generic
5427                                                              load/load
5428                                                              atomic/store/store
5429                                                              atomic/atomicrmw.
5430                                                            - Ensures any
5431                                                              following global
5432                                                              data read is no
5433                                                              older than a local
5434                                                              atomicrmw value
5435                                                              being acquired.
5436
5437      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
5438                                - system                  2. s_waitcnt vmcnt(0)
5439
5440                                                            - Must happen before
5441                                                              following
5442                                                              buffer_wbinvl1_vol.
5443                                                            - Ensures the
5444                                                              atomicrmw has
5445                                                              completed before
5446                                                              invalidating the
5447                                                              cache.
5448
5449                                                          3. buffer_wbinvl1_vol
5450
5451                                                            - Must happen before
5452                                                              any following
5453                                                              global/generic
5454                                                              load/load
5455                                                              atomic/atomicrmw.
5456                                                            - Ensures that
5457                                                              following loads
5458                                                              will not see stale
5459                                                              global data.
5460
5461      atomicrmw    acquire      - agent        - generic  1. flat_atomic
5462                                - system                  2. s_waitcnt vmcnt(0) &
5463                                                             lgkmcnt(0)
5464
5465                                                            - If OpenCL, omit
5466                                                              lgkmcnt(0).
5467                                                            - Must happen before
5468                                                              following
5469                                                              buffer_wbinvl1_vol.
5470                                                            - Ensures the
5471                                                              atomicrmw has
5472                                                              completed before
5473                                                              invalidating the
5474                                                              cache.
5475
5476                                                          3. buffer_wbinvl1_vol
5477
5478                                                            - Must happen before
5479                                                              any following
5480                                                              global/generic
5481                                                              load/load
5482                                                              atomic/atomicrmw.
5483                                                            - Ensures that
5484                                                              following loads
5485                                                              will not see stale
5486                                                              global data.
5487
5488      fence        acquire      - singlethread *none*     *none*
5489                                - wavefront
5490      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5491
5492                                                            - If OpenCL and
5493                                                              address space is
5494                                                              not generic, omit.
5495                                                            - However, since LLVM
5496                                                              currently has no
5497                                                              address space on
5498                                                              the fence need to
5499                                                              conservatively
5500                                                              always generate. If
5501                                                              fence had an
5502                                                              address space then
5503                                                              set to address
5504                                                              space of OpenCL
5505                                                              fence flag, or to
5506                                                              generic if both
5507                                                              local and global
5508                                                              flags are
5509                                                              specified.
5510                                                            - Must happen after
5511                                                              any preceding
5512                                                              local/generic load
5513                                                              atomic/atomicrmw
5514                                                              with an equal or
5515                                                              wider sync scope
5516                                                              and memory ordering
5517                                                              stronger than
5518                                                              unordered (this is
5519                                                              termed the
5520                                                              fence-paired-atomic).
5521                                                            - Must happen before
5522                                                              any following
5523                                                              global/generic
5524                                                              load/load
5525                                                              atomic/store/store
5526                                                              atomic/atomicrmw.
5527                                                            - Ensures any
5528                                                              following global
5529                                                              data read is no
5530                                                              older than the
5531                                                              value read by the
5532                                                              fence-paired-atomic.
5533
5534      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5535                                - system                     vmcnt(0)
5536
5537                                                            - If OpenCL and
5538                                                              address space is
5539                                                              not generic, omit
5540                                                              lgkmcnt(0).
5541                                                            - However, since LLVM
5542                                                              currently has no
5543                                                              address space on
5544                                                              the fence need to
5545                                                              conservatively
5546                                                              always generate
5547                                                              (see comment for
5548                                                              previous fence).
5549                                                            - Could be split into
5550                                                              separate s_waitcnt
5551                                                              vmcnt(0) and
5552                                                              s_waitcnt
5553                                                              lgkmcnt(0) to allow
5554                                                              them to be
5555                                                              independently moved
5556                                                              according to the
5557                                                              following rules.
5558                                                            - s_waitcnt vmcnt(0)
5559                                                              must happen after
5560                                                              any preceding
5561                                                              global/generic load
5562                                                              atomic/atomicrmw
5563                                                              with an equal or
5564                                                              wider sync scope
5565                                                              and memory ordering
5566                                                              stronger than
5567                                                              unordered (this is
5568                                                              termed the
5569                                                              fence-paired-atomic).
5570                                                            - s_waitcnt lgkmcnt(0)
5571                                                              must happen after
5572                                                              any preceding
5573                                                              local/generic load
5574                                                              atomic/atomicrmw
5575                                                              with an equal or
5576                                                              wider sync scope
5577                                                              and memory ordering
5578                                                              stronger than
5579                                                              unordered (this is
5580                                                              termed the
5581                                                              fence-paired-atomic).
5582                                                            - Must happen before
5583                                                              the following
5584                                                              buffer_wbinvl1_vol.
5585                                                            - Ensures that the
5586                                                              fence-paired atomic
5587                                                              has completed
5588                                                              before invalidating
5589                                                              the
5590                                                              cache. Therefore
5591                                                              any following
5592                                                              locations read must
5593                                                              be no older than
5594                                                              the value read by
5595                                                              the
5596                                                              fence-paired-atomic.
5597
5598                                                          2. buffer_wbinvl1_vol
5599
5600                                                            - Must happen before any
5601                                                              following global/generic
5602                                                              load/load
5603                                                              atomic/store/store
5604                                                              atomic/atomicrmw.
5605                                                            - Ensures that
5606                                                              following loads
5607                                                              will not see stale
5608                                                              global data.
5609
5610      **Release Atomic**
5611      ------------------------------------------------------------------------------------
5612      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
5613                                - wavefront    - local
5614                                               - generic
5615      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5616                                               - generic
5617                                                            - If OpenCL, omit.
5618                                                            - Must happen after
5619                                                              any preceding
5620                                                              local/generic
5621                                                              load/store/load
5622                                                              atomic/store
5623                                                              atomic/atomicrmw.
5624                                                            - Must happen before
5625                                                              the following
5626                                                              store.
5627                                                            - Ensures that all
5628                                                              memory operations
5629                                                              to local have
5630                                                              completed before
5631                                                              performing the
5632                                                              store that is being
5633                                                              released.
5634
5635                                                          2. buffer/global/flat_store
5636      store atomic release      - workgroup    - local    1. ds_store
5637      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5638                                - system       - generic     vmcnt(0)
5639
5640                                                            - If OpenCL and
5641                                                              address space is
5642                                                              not generic, omit
5643                                                              lgkmcnt(0).
5644                                                            - Could be split into
5645                                                              separate s_waitcnt
5646                                                              vmcnt(0) and
5647                                                              s_waitcnt
5648                                                              lgkmcnt(0) to allow
5649                                                              them to be
5650                                                              independently moved
5651                                                              according to the
5652                                                              following rules.
5653                                                            - s_waitcnt vmcnt(0)
5654                                                              must happen after
5655                                                              any preceding
5656                                                              global/generic
5657                                                              load/store/load
5658                                                              atomic/store
5659                                                              atomic/atomicrmw.
5660                                                            - s_waitcnt lgkmcnt(0)
5661                                                              must happen after
5662                                                              any preceding
5663                                                              local/generic
5664                                                              load/store/load
5665                                                              atomic/store
5666                                                              atomic/atomicrmw.
5667                                                            - Must happen before
5668                                                              the following
5669                                                              store.
5670                                                            - Ensures that all
5671                                                              memory operations
5672                                                              to memory have
5673                                                              completed before
5674                                                              performing the
5675                                                              store that is being
5676                                                              released.
5677
5678                                                          2. buffer/global/flat_store
5679      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
5680                                - wavefront    - local
5681                                               - generic
5682      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5683                                               - generic
5684                                                            - If OpenCL, omit.
5685                                                            - Must happen after
5686                                                              any preceding
5687                                                              local/generic
5688                                                              load/store/load
5689                                                              atomic/store
5690                                                              atomic/atomicrmw.
5691                                                            - Must happen before
5692                                                              the following
5693                                                              atomicrmw.
5694                                                            - Ensures that all
5695                                                              memory operations
5696                                                              to local have
5697                                                              completed before
5698                                                              performing the
5699                                                              atomicrmw that is
5700                                                              being released.
5701
5702                                                          2. buffer/global/flat_atomic
5703      atomicrmw    release      - workgroup    - local    1. ds_atomic
5704      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5705                                - system       - generic     vmcnt(0)
5706
5707                                                            - If OpenCL, omit
5708                                                              lgkmcnt(0).
5709                                                            - Could be split into
5710                                                              separate s_waitcnt
5711                                                              vmcnt(0) and
5712                                                              s_waitcnt
5713                                                              lgkmcnt(0) to allow
5714                                                              them to be
5715                                                              independently moved
5716                                                              according to the
5717                                                              following rules.
5718                                                            - s_waitcnt vmcnt(0)
5719                                                              must happen after
5720                                                              any preceding
5721                                                              global/generic
5722                                                              load/store/load
5723                                                              atomic/store
5724                                                              atomic/atomicrmw.
5725                                                            - s_waitcnt lgkmcnt(0)
5726                                                              must happen after
5727                                                              any preceding
5728                                                              local/generic
5729                                                              load/store/load
5730                                                              atomic/store
5731                                                              atomic/atomicrmw.
5732                                                            - Must happen before
5733                                                              the following
5734                                                              atomicrmw.
5735                                                            - Ensures that all
5736                                                              memory operations
5737                                                              to global and local
5738                                                              have completed
5739                                                              before performing
5740                                                              the atomicrmw that
5741                                                              is being released.
5742
5743                                                          2. buffer/global/flat_atomic
5744      fence        release      - singlethread *none*     *none*
5745                                - wavefront
5746      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5747
5748                                                            - If OpenCL and
5749                                                              address space is
5750                                                              not generic, omit.
5751                                                            - However, since LLVM
5752                                                              currently has no
5753                                                              address space on
5754                                                              the fence need to
5755                                                              conservatively
5756                                                              always generate. If
5757                                                              fence had an
5758                                                              address space then
5759                                                              set to address
5760                                                              space of OpenCL
5761                                                              fence flag, or to
5762                                                              generic if both
5763                                                              local and global
5764                                                              flags are
5765                                                              specified.
5766                                                            - Must happen after
5767                                                              any preceding
5768                                                              local/generic
5769                                                              load/load
5770                                                              atomic/store/store
5771                                                              atomic/atomicrmw.
5772                                                            - Must happen before
5773                                                              any following store
5774                                                              atomic/atomicrmw
5775                                                              with an equal or
5776                                                              wider sync scope
5777                                                              and memory ordering
5778                                                              stronger than
5779                                                              unordered (this is
5780                                                              termed the
5781                                                              fence-paired-atomic).
5782                                                            - Ensures that all
5783                                                              memory operations
5784                                                              to local have
5785                                                              completed before
5786                                                              performing the
5787                                                              following
5788                                                              fence-paired-atomic.
5789
5790      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5791                                - system                     vmcnt(0)
5792
5793                                                            - If OpenCL and
5794                                                              address space is
5795                                                              not generic, omit
5796                                                              lgkmcnt(0).
5797                                                            - If OpenCL and
5798                                                              address space is
5799                                                              local, omit
5800                                                              vmcnt(0).
5801                                                            - However, since LLVM
5802                                                              currently has no
5803                                                              address space on
5804                                                              the fence need to
5805                                                              conservatively
5806                                                              always generate. If
5807                                                              fence had an
5808                                                              address space then
5809                                                              set to address
5810                                                              space of OpenCL
5811                                                              fence flag, or to
5812                                                              generic if both
5813                                                              local and global
5814                                                              flags are
5815                                                              specified.
5816                                                            - Could be split into
5817                                                              separate s_waitcnt
5818                                                              vmcnt(0) and
5819                                                              s_waitcnt
5820                                                              lgkmcnt(0) to allow
5821                                                              them to be
5822                                                              independently moved
5823                                                              according to the
5824                                                              following rules.
5825                                                            - s_waitcnt vmcnt(0)
5826                                                              must happen after
5827                                                              any preceding
5828                                                              global/generic
5829                                                              load/store/load
5830                                                              atomic/store
5831                                                              atomic/atomicrmw.
5832                                                            - s_waitcnt lgkmcnt(0)
5833                                                              must happen after
5834                                                              any preceding
5835                                                              local/generic
5836                                                              load/store/load
5837                                                              atomic/store
5838                                                              atomic/atomicrmw.
5839                                                            - Must happen before
5840                                                              any following store
5841                                                              atomic/atomicrmw
5842                                                              with an equal or
5843                                                              wider sync scope
5844                                                              and memory ordering
5845                                                              stronger than
5846                                                              unordered (this is
5847                                                              termed the
5848                                                              fence-paired-atomic).
5849                                                            - Ensures that all
5850                                                              memory operations
5851                                                              have
5852                                                              completed before
5853                                                              performing the
5854                                                              following
5855                                                              fence-paired-atomic.
5856
5857      **Acquire-Release Atomic**
5858      ------------------------------------------------------------------------------------
5859      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
5860                                - wavefront    - local
5861                                               - generic
5862      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5863
5864                                                            - If OpenCL, omit.
5865                                                            - Must happen after
5866                                                              any preceding
5867                                                              local/generic
5868                                                              load/store/load
5869                                                              atomic/store
5870                                                              atomic/atomicrmw.
5871                                                            - Must happen before
5872                                                              the following
5873                                                              atomicrmw.
5874                                                            - Ensures that all
5875                                                              memory operations
5876                                                              to local have
5877                                                              completed before
5878                                                              performing the
5879                                                              atomicrmw that is
5880                                                              being released.
5881
5882                                                          2. buffer/global_atomic
5883
5884      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
5885                                                          2. s_waitcnt lgkmcnt(0)
5886
5887                                                            - If OpenCL, omit.
5888                                                            - Must happen before
5889                                                              any following
5890                                                              global/generic
5891                                                              load/load
5892                                                              atomic/store/store
5893                                                              atomic/atomicrmw.
5894                                                            - Ensures any
5895                                                              following global
5896                                                              data read is no
5897                                                              older than the local load
5898                                                              atomic value being
5899                                                              acquired.
5900
5901      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
5902
5903                                                            - If OpenCL, omit.
5904                                                            - Must happen after
5905                                                              any preceding
5906                                                              local/generic
5907                                                              load/store/load
5908                                                              atomic/store
5909                                                              atomic/atomicrmw.
5910                                                            - Must happen before
5911                                                              the following
5912                                                              atomicrmw.
5913                                                            - Ensures that all
5914                                                              memory operations
5915                                                              to local have
5916                                                              completed before
5917                                                              performing the
5918                                                              atomicrmw that is
5919                                                              being released.
5920
5921                                                          2. flat_atomic
5922                                                          3. s_waitcnt lgkmcnt(0)
5923
5924                                                            - If OpenCL, omit.
5925                                                            - Must happen before
5926                                                              any following
5927                                                              global/generic
5928                                                              load/load
5929                                                              atomic/store/store
5930                                                              atomic/atomicrmw.
5931                                                            - Ensures any
5932                                                              following global
5933                                                              data read is no
5934                                                              older than a local load
5935                                                              atomic value being
5936                                                              acquired.
5937
5938      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5939                                - system                     vmcnt(0)
5940
5941                                                            - If OpenCL, omit
5942                                                              lgkmcnt(0).
5943                                                            - Could be split into
5944                                                              separate s_waitcnt
5945                                                              vmcnt(0) and
5946                                                              s_waitcnt
5947                                                              lgkmcnt(0) to allow
5948                                                              them to be
5949                                                              independently moved
5950                                                              according to the
5951                                                              following rules.
5952                                                            - s_waitcnt vmcnt(0)
5953                                                              must happen after
5954                                                              any preceding
5955                                                              global/generic
5956                                                              load/store/load
5957                                                              atomic/store
5958                                                              atomic/atomicrmw.
5959                                                            - s_waitcnt lgkmcnt(0)
5960                                                              must happen after
5961                                                              any preceding
5962                                                              local/generic
5963                                                              load/store/load
5964                                                              atomic/store
5965                                                              atomic/atomicrmw.
5966                                                            - Must happen before
5967                                                              the following
5968                                                              atomicrmw.
5969                                                            - Ensures that all
5970                                                              memory operations
5971                                                              to global have
5972                                                              completed before
5973                                                              performing the
5974                                                              atomicrmw that is
5975                                                              being released.
5976
5977                                                          2. buffer/global_atomic
5978                                                          3. s_waitcnt vmcnt(0)
5979
5980                                                            - Must happen before
5981                                                              following
5982                                                              buffer_wbinvl1_vol.
5983                                                            - Ensures the
5984                                                              atomicrmw has
5985                                                              completed before
5986                                                              invalidating the
5987                                                              cache.
5988
5989                                                          4. buffer_wbinvl1_vol
5990
5991                                                            - Must happen before
5992                                                              any following
5993                                                              global/generic
5994                                                              load/load
5995                                                              atomic/atomicrmw.
5996                                                            - Ensures that
5997                                                              following loads
5998                                                              will not see stale
5999                                                              global data.
6000
6001      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
6002                                - system                     vmcnt(0)
6003
6004                                                            - If OpenCL, omit
6005                                                              lgkmcnt(0).
6006                                                            - Could be split into
6007                                                              separate s_waitcnt
6008                                                              vmcnt(0) and
6009                                                              s_waitcnt
6010                                                              lgkmcnt(0) to allow
6011                                                              them to be
6012                                                              independently moved
6013                                                              according to the
6014                                                              following rules.
6015                                                            - s_waitcnt vmcnt(0)
6016                                                              must happen after
6017                                                              any preceding
6018                                                              global/generic
6019                                                              load/store/load
6020                                                              atomic/store
6021                                                              atomic/atomicrmw.
6022                                                            - s_waitcnt lgkmcnt(0)
6023                                                              must happen after
6024                                                              any preceding
6025                                                              local/generic
6026                                                              load/store/load
6027                                                              atomic/store
6028                                                              atomic/atomicrmw.
6029                                                            - Must happen before
6030                                                              the following
6031                                                              atomicrmw.
6032                                                            - Ensures that all
6033                                                              memory operations
6034                                                              to global have
6035                                                              completed before
6036                                                              performing the
6037                                                              atomicrmw that is
6038                                                              being released.
6039
6040                                                          2. flat_atomic
6041                                                          3. s_waitcnt vmcnt(0) &
6042                                                             lgkmcnt(0)
6043
6044                                                            - If OpenCL, omit
6045                                                              lgkmcnt(0).
6046                                                            - Must happen before
6047                                                              following
6048                                                              buffer_wbinvl1_vol.
6049                                                            - Ensures the
6050                                                              atomicrmw has
6051                                                              completed before
6052                                                              invalidating the
6053                                                              cache.
6054
6055                                                          4. buffer_wbinvl1_vol
6056
6057                                                            - Must happen before
6058                                                              any following
6059                                                              global/generic
6060                                                              load/load
6061                                                              atomic/atomicrmw.
6062                                                            - Ensures that
6063                                                              following loads
6064                                                              will not see stale
6065                                                              global data.
6066
6067      fence        acq_rel      - singlethread *none*     *none*
6068                                - wavefront
6069      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
6070
6071                                                            - If OpenCL and
6072                                                              address space is
6073                                                              not generic, omit.
6074                                                            - However,
6075                                                              since LLVM
6076                                                              currently has no
6077                                                              address space on
6078                                                              the fence need to
6079                                                              conservatively
6080                                                              always generate
6081                                                              (see comment for
6082                                                              previous fence).
6083                                                            - Must happen after
6084                                                              any preceding
6085                                                              local/generic
6086                                                              load/load
6087                                                              atomic/store/store
6088                                                              atomic/atomicrmw.
6089                                                            - Must happen before
6090                                                              any following
6091                                                              global/generic
6092                                                              load/load
6093                                                              atomic/store/store
6094                                                              atomic/atomicrmw.
6095                                                            - Ensures that all
6096                                                              memory operations
6097                                                              to local have
6098                                                              completed before
6099                                                              performing any
6100                                                              following global
6101                                                              memory operations.
6102                                                            - Ensures that the
6103                                                              preceding
6104                                                              local/generic load
6105                                                              atomic/atomicrmw
6106                                                              with an equal or
6107                                                              wider sync scope
6108                                                              and memory ordering
6109                                                              stronger than
6110                                                              unordered (this is
6111                                                              termed the
6112                                                              acquire-fence-paired-atomic)
6113                                                              has completed
6114                                                              before following
6115                                                              global memory
6116                                                              operations. This
6117                                                              satisfies the
6118                                                              requirements of
6119                                                              acquire.
6120                                                            - Ensures that all
6121                                                              previous memory
6122                                                              operations have
6123                                                              completed before a
6124                                                              following
6125                                                              local/generic store
6126                                                              atomic/atomicrmw
6127                                                              with an equal or
6128                                                              wider sync scope
6129                                                              and memory ordering
6130                                                              stronger than
6131                                                              unordered (this is
6132                                                              termed the
6133                                                              release-fence-paired-atomic).
6134                                                              This satisfies the
6135                                                              requirements of
6136                                                              release.
6137
6138      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
6139                                - system                     vmcnt(0)
6140
6141                                                            - If OpenCL and
6142                                                              address space is
6143                                                              not generic, omit
6144                                                              lgkmcnt(0).
6145                                                            - However, since LLVM
6146                                                              currently has no
6147                                                              address space on
6148                                                              the fence need to
6149                                                              conservatively
6150                                                              always generate
6151                                                              (see comment for
6152                                                              previous fence).
6153                                                            - Could be split into
6154                                                              separate s_waitcnt
6155                                                              vmcnt(0) and
6156                                                              s_waitcnt
6157                                                              lgkmcnt(0) to allow
6158                                                              them to be
6159                                                              independently moved
6160                                                              according to the
6161                                                              following rules.
6162                                                            - s_waitcnt vmcnt(0)
6163                                                              must happen after
6164                                                              any preceding
6165                                                              global/generic
6166                                                              load/store/load
6167                                                              atomic/store
6168                                                              atomic/atomicrmw.
6169                                                            - s_waitcnt lgkmcnt(0)
6170                                                              must happen after
6171                                                              any preceding
6172                                                              local/generic
6173                                                              load/store/load
6174                                                              atomic/store
6175                                                              atomic/atomicrmw.
6176                                                            - Must happen before
6177                                                              the following
6178                                                              buffer_wbinvl1_vol.
6179                                                            - Ensures that the
6180                                                              preceding
6181                                                              global/local/generic
6182                                                              load
6183                                                              atomic/atomicrmw
6184                                                              with an equal or
6185                                                              wider sync scope
6186                                                              and memory ordering
6187                                                              stronger than
6188                                                              unordered (this is
6189                                                              termed the
6190                                                              acquire-fence-paired-atomic)
6191                                                              has completed
6192                                                              before invalidating
6193                                                              the cache. This
6194                                                              satisfies the
6195                                                              requirements of
6196                                                              acquire.
6197                                                            - Ensures that all
6198                                                              previous memory
6199                                                              operations have
6200                                                              completed before a
6201                                                              following
6202                                                              global/local/generic
6203                                                              store
6204                                                              atomic/atomicrmw
6205                                                              with an equal or
6206                                                              wider sync scope
6207                                                              and memory ordering
6208                                                              stronger than
6209                                                              unordered (this is
6210                                                              termed the
6211                                                              release-fence-paired-atomic).
6212                                                              This satisfies the
6213                                                              requirements of
6214                                                              release.
6215
6216                                                          2. buffer_wbinvl1_vol
6217
6218                                                            - Must happen before
6219                                                              any following
6220                                                              global/generic
6221                                                              load/load
6222                                                              atomic/store/store
6223                                                              atomic/atomicrmw.
6224                                                            - Ensures that
6225                                                              following loads
6226                                                              will not see stale
6227                                                              global data. This
6228                                                              satisfies the
6229                                                              requirements of
6230                                                              acquire.
6231
6232      **Sequential Consistent Atomic**
6233      ------------------------------------------------------------------------------------
6234      load atomic  seq_cst      - singlethread - global   *Same as corresponding
6235                                - wavefront    - local    load atomic acquire,
6236                                               - generic  except must generate
6237                                                          all instructions even
6238                                                          for OpenCL.*
6239      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
6240                                               - generic
6241
6242                                                            - Must
6243                                                              happen after
6244                                                              preceding
6245                                                              local/generic load
6246                                                              atomic/store
6247                                                              atomic/atomicrmw
6248                                                              with memory
6249                                                              ordering of seq_cst
6250                                                              and with equal or
6251                                                              wider sync scope.
6252                                                              (Note that seq_cst
6253                                                              fences have their
6254                                                              own s_waitcnt
6255                                                              lgkmcnt(0) and so do
6256                                                              not need to be
6257                                                              considered.)
6258                                                            - Ensures any
6259                                                              preceding
6260                                                              sequential
6261                                                              consistent local
6262                                                              memory instructions
6263                                                              have completed
6264                                                              before executing
6265                                                              this sequentially
6266                                                              consistent
6267                                                              instruction. This
6268                                                              prevents reordering
6269                                                              a seq_cst store
6270                                                              followed by a
6271                                                              seq_cst load. (Note
6272                                                              that seq_cst is
6273                                                              stronger than
6274                                                              acquire/release as
6275                                                              the reordering of
6276                                                              load acquire
6277                                                              followed by a store
6278                                                              release is
6279                                                              prevented by the
6280                                                              s_waitcnt of
6281                                                              the release, but
6282                                                              there is nothing
6283                                                              preventing a store
6284                                                              release followed by
6285                                                              load acquire from
6286                                                              completing out of
6287                                                              order. The s_waitcnt
6288                                                              could be placed after
6289                                                              seq_store or before
6290                                                              the seq_load. We
6291                                                              choose the load to
6292                                                              make the s_waitcnt be
6293                                                              as late as possible
6294                                                              so that the store
6295                                                              may have already
6296                                                              completed.)
6297
6298                                                          2. *Following
6299                                                             instructions same as
6300                                                             corresponding load
6301                                                             atomic acquire,
6302                                                             except must generate
6303                                                             all instructions even
6304                                                             for OpenCL.*
6305      load atomic  seq_cst      - workgroup    - local    *Same as corresponding
6306                                                          load atomic acquire,
6307                                                          except must generate
6308                                                          all instructions even
6309                                                          for OpenCL.*
6310
6311      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
6312                                - system       - generic     vmcnt(0)
6313
6314                                                            - Could be split into
6315                                                              separate s_waitcnt
6316                                                              vmcnt(0)
6317                                                              and s_waitcnt
6318                                                              lgkmcnt(0) to allow
6319                                                              them to be
6320                                                              independently moved
6321                                                              according to the
6322                                                              following rules.
6323                                                            - s_waitcnt lgkmcnt(0)
6324                                                              must happen after
6325                                                              preceding
6326                                                              global/generic load
6327                                                              atomic/store
6328                                                              atomic/atomicrmw
6329                                                              with memory
6330                                                              ordering of seq_cst
6331                                                              and with equal or
6332                                                              wider sync scope.
6333                                                              (Note that seq_cst
6334                                                              fences have their
6335                                                              own s_waitcnt
6336                                                              lgkmcnt(0) and so do
6337                                                              not need to be
6338                                                              considered.)
6339                                                            - s_waitcnt vmcnt(0)
6340                                                              must happen after
6341                                                              preceding
6342                                                              global/generic load
6343                                                              atomic/store
6344                                                              atomic/atomicrmw
6345                                                              with memory
6346                                                              ordering of seq_cst
6347                                                              and with equal or
6348                                                              wider sync scope.
6349                                                              (Note that seq_cst
6350                                                              fences have their
6351                                                              own s_waitcnt
6352                                                              vmcnt(0) and so do
6353                                                              not need to be
6354                                                              considered.)
6355                                                            - Ensures any
6356                                                              preceding
6357                                                              sequential
6358                                                              consistent global
6359                                                              memory instructions
6360                                                              have completed
6361                                                              before executing
6362                                                              this sequentially
6363                                                              consistent
6364                                                              instruction. This
6365                                                              prevents reordering
6366                                                              a seq_cst store
6367                                                              followed by a
6368                                                              seq_cst load. (Note
6369                                                              that seq_cst is
6370                                                              stronger than
6371                                                              acquire/release as
6372                                                              the reordering of
6373                                                              load acquire
6374                                                              followed by a store
6375                                                              release is
6376                                                              prevented by the
6377                                                              s_waitcnt of
6378                                                              the release, but
6379                                                              there is nothing
6380                                                              preventing a store
6381                                                              release followed by
6382                                                              load acquire from
6383                                                              completing out of
6384                                                              order. The s_waitcnt
6385                                                              could be placed after
6386                                                              seq_store or before
6387                                                              the seq_load. We
6388                                                              choose the load to
6389                                                              make the s_waitcnt be
6390                                                              as late as possible
6391                                                              so that the store
6392                                                              may have already
6393                                                              completed.)
6394
6395                                                          2. *Following
6396                                                             instructions same as
6397                                                             corresponding load
6398                                                             atomic acquire,
6399                                                             except must generate
6400                                                             all instructions even
6401                                                             for OpenCL.*
6402      store atomic seq_cst      - singlethread - global   *Same as corresponding
6403                                - wavefront    - local    store atomic release,
6404                                - workgroup    - generic  except must generate
6405                                - agent                   all instructions even
6406                                - system                  for OpenCL.*
6407      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
6408                                - wavefront    - local    atomicrmw acq_rel,
6409                                - workgroup    - generic  except must generate
6410                                - agent                   all instructions even
6411                                - system                  for OpenCL.*
6412      fence        seq_cst      - singlethread *none*     *Same as corresponding
6413                                - wavefront               fence acq_rel,
6414                                - workgroup               except must generate
6415                                - agent                   all instructions even
6416                                - system                  for OpenCL.*
6417      ============ ============ ============== ========== ================================
6418
6419 .. _amdgpu-amdhsa-memory-model-gfx90a:
6420
6421 Memory Model GFX90A
6422 +++++++++++++++++++
6423
6424 For GFX90A:
6425
6426 * Each agent has multiple shader arrays (SA).
6427 * Each SA has multiple compute units (CU).
6428 * Each CU has multiple SIMDs that execute wavefronts.
6429 * The wavefronts for a single work-group are executed in the same CU but may be
6430   executed by different SIMDs. The exception is when in tgsplit execution mode
6431   when the wavefronts may be executed by different SIMDs in different CUs.
6432 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
6433   executing on it. The exception is when in tgsplit execution mode when no LDS
6434   is allocated as wavefronts of the same work-group can be in different CUs.
6435 * All LDS operations of a CU are performed as wavefront wide operations in a
6436   global order and involve no caching. Completion is reported to a wavefront in
6437   execution order.
6438 * The LDS memory has multiple request queues shared by the SIMDs of a
6439   CU. Therefore, the LDS operations performed by different wavefronts of a
6440   work-group can be reordered relative to each other, which can result in
6441   reordering the visibility of vector memory operations with respect to LDS
6442   operations of other wavefronts in the same work-group. A ``s_waitcnt
6443   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6444   vector memory operations between wavefronts of a work-group, but not between
6445   operations performed by the same wavefront.
6446 * The vector memory operations are performed as wavefront wide operations and
6447   completion is reported to a wavefront in execution order. The exception is
6448   that ``flat_load/store/atomic`` instructions can report out of vector memory
6449   order if they access LDS memory, and out of LDS operation order if they access
6450   global memory.
6451 * The vector memory operations access a single vector L1 cache shared by all
6452   SIMDs a CU. Therefore:
6453
6454   * No special action is required for coherence between the lanes of a single
6455     wavefront.
6456
6457   * No special action is required for coherence between wavefronts in the same
6458     work-group since they execute on the same CU. The exception is when in
6459     tgsplit execution mode as wavefronts of the same work-group can be in
6460     different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6461     the following item.
6462
6463   * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6464     executing in different work-groups as they may be executing on different
6465     CUs.
6466
6467 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
6468   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6469   scalar operations are used in a restricted way so do not impact the memory
6470   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6471 * The vector and scalar memory operations use an L2 cache shared by all CUs on
6472   the same agent.
6473
6474   * The L2 cache has independent channels to service disjoint ranges of virtual
6475     addresses.
6476   * Each CU has a separate request queue per channel. Therefore, the vector and
6477     scalar memory operations performed by wavefronts executing in different
6478     work-groups (which may be executing on different CUs), or the same
6479     work-group if executing in tgsplit mode, of an agent can be reordered
6480     relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6481     synchronization between vector memory operations of different CUs. It
6482     ensures a previous vector memory operation has completed before executing a
6483     subsequent vector memory or LDS operation and so can be used to meet the
6484     requirements of acquire and release.
6485   * The L2 cache of one agent can be kept coherent with other agents by:
6486     using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6487     C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6488     the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6489
6490     * Any local memory cache lines will be automatically invalidated by writes
6491       from CUs associated with other L2 caches, or writes from the CPU, due to
6492       the cache probe caused by coherent requests. Coherent requests are caused
6493       by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6494       XGMI, and by PCIe requests that are configured to be coherent requests.
6495     * XGMI accesses from the CPU to local memory may be cached on the CPU.
6496       Subsequent access from the GPU will automatically invalidate or writeback
6497       the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6498     * Since all work-groups on the same agent share the same L2, no L2
6499       invalidation or writeback is required for coherence.
6500     * To ensure coherence of local and remote memory writes of work-groups in
6501       different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6502       cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6503       ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6504       fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6505       remote fine grain memory) bypasses the L2, so both will never result in
6506       dirty L2 cache lines.
6507     * To ensure coherence of local and remote memory reads of work-groups in
6508       different agents a ``buffer_invl2`` is required. It will invalidate L2
6509       cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6510       MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6511       coarse memory) cause local reads to be invalidated by remote writes with
6512       with the PTE C-bit so these cache lines are not invalidated. Note that
6513       MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6514       never result in L2 cache lines that need to be invalidated.
6515
6516   * PCIe access from the GPU to the CPU memory is kept coherent by using the
6517     MTYPE UC (uncached) which bypasses the L2.
6518
6519 Scalar memory operations are only used to access memory that is proven to not
6520 change during the execution of the kernel dispatch. This includes constant
6521 address space and global address space for program scope ``const`` variables.
6522 Therefore, the kernel machine code does not have to maintain the scalar cache to
6523 ensure it is coherent with the vector caches. The scalar and vector caches are
6524 invalidated between kernel dispatches by CP since constant address space data
6525 may change between kernel dispatch executions. See
6526 :ref:`amdgpu-amdhsa-memory-spaces`.
6527
6528 The one exception is if scalar writes are used to spill SGPR registers. In this
6529 case the AMDGPU backend ensures the memory location used to spill is never
6530 accessed by vector memory operations at the same time. If scalar writes are used
6531 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6532 return since the locations may be used for vector memory instructions by a
6533 future wavefront that uses the same scratch area, or a function call that
6534 creates a frame at the same address, respectively. There is no need for a
6535 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6536
6537 For kernarg backing memory:
6538
6539 * CP invalidates the L1 cache at the start of each kernel dispatch.
6540 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6541   memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6542   cache. This also causes it to be treated as non-volatile and so is not
6543   invalidated by ``*_vol``.
6544 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6545   so the L2 cache will be coherent with the CPU and other agents.
6546
6547 Scratch backing memory (which is used for the private address space) is accessed
6548 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6549 only accessed by a single thread, and is always write-before-read, there is
6550 never a need to invalidate these entries from the L1 cache. Hence all cache
6551 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6552
6553 The code sequences used to implement the memory model for GFX90A are defined
6554 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6555
6556   .. table:: AMDHSA Memory Model Code Sequences GFX90A
6557      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6558
6559      ============ ============ ============== ========== ================================
6560      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
6561                   Ordering     Sync Scope     Address    GFX90A
6562                                               Space
6563      ============ ============ ============== ========== ================================
6564      **Non-Atomic**
6565      ------------------------------------------------------------------------------------
6566      load         *none*       *none*         - global   - !volatile & !nontemporal
6567                                               - generic
6568                                               - private    1. buffer/global/flat_load
6569                                               - constant
6570                                                          - !volatile & nontemporal
6571
6572                                                            1. buffer/global/flat_load
6573                                                               glc=1 slc=1
6574
6575                                                          - volatile
6576
6577                                                            1. buffer/global/flat_load
6578                                                               glc=1
6579                                                            2. s_waitcnt vmcnt(0)
6580
6581                                                             - Must happen before
6582                                                               any following volatile
6583                                                               global/generic
6584                                                               load/store.
6585                                                             - Ensures that
6586                                                               volatile
6587                                                               operations to
6588                                                               different
6589                                                               addresses will not
6590                                                               be reordered by
6591                                                               hardware.
6592
6593      load         *none*       *none*         - local    1. ds_load
6594      store        *none*       *none*         - global   - !volatile & !nontemporal
6595                                               - generic
6596                                               - private    1. buffer/global/flat_store
6597                                               - constant
6598                                                          - !volatile & nontemporal
6599
6600                                                            1. buffer/global/flat_store
6601                                                               glc=1 slc=1
6602
6603                                                          - volatile
6604
6605                                                            1. buffer/global/flat_store
6606                                                            2. s_waitcnt vmcnt(0)
6607
6608                                                             - Must happen before
6609                                                               any following volatile
6610                                                               global/generic
6611                                                               load/store.
6612                                                             - Ensures that
6613                                                               volatile
6614                                                               operations to
6615                                                               different
6616                                                               addresses will not
6617                                                               be reordered by
6618                                                               hardware.
6619
6620      store        *none*       *none*         - local    1. ds_store
6621      **Unordered Atomic**
6622      ------------------------------------------------------------------------------------
6623      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
6624      store atomic unordered    *any*          *any*      *Same as non-atomic*.
6625      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
6626      **Monotonic Atomic**
6627      ------------------------------------------------------------------------------------
6628      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
6629                                - wavefront    - generic
6630      load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
6631                                               - generic     glc=1
6632
6633                                                            - If not TgSplit execution
6634                                                              mode, omit glc=1.
6635
6636      load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
6637                                - wavefront               local address space cannot
6638                                - workgroup               be used.*
6639
6640                                                          1. ds_load
6641      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
6642                                               - generic     glc=1
6643      load atomic  monotonic    - system       - global   1. buffer/global/flat_load
6644                                               - generic     glc=1
6645      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
6646                                - wavefront    - generic
6647                                - workgroup
6648                                - agent
6649      store atomic monotonic    - system       - global   1. buffer/global/flat_store
6650                                               - generic
6651      store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
6652                                - wavefront               local address space cannot
6653                                - workgroup               be used.*
6654
6655                                                          1. ds_store
6656      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
6657                                - wavefront    - generic
6658                                - workgroup
6659                                - agent
6660      atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
6661                                               - generic
6662      atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
6663                                - wavefront               local address space cannot
6664                                - workgroup               be used.*
6665
6666                                                          1. ds_atomic
6667      **Acquire Atomic**
6668      ------------------------------------------------------------------------------------
6669      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
6670                                - wavefront    - local
6671                                               - generic
6672      load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
6673
6674                                                            - If not TgSplit execution
6675                                                              mode, omit glc=1.
6676
6677                                                          2. s_waitcnt vmcnt(0)
6678
6679                                                            - If not TgSplit execution
6680                                                              mode, omit.
6681                                                            - Must happen before the
6682                                                              following buffer_wbinvl1_vol.
6683
6684                                                          3. buffer_wbinvl1_vol
6685
6686                                                            - If not TgSplit execution
6687                                                              mode, omit.
6688                                                            - Must happen before
6689                                                              any following
6690                                                              global/generic
6691                                                              load/load
6692                                                              atomic/store/store
6693                                                              atomic/atomicrmw.
6694                                                            - Ensures that
6695                                                              following
6696                                                              loads will not see
6697                                                              stale data.
6698
6699      load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
6700                                                          local address space cannot
6701                                                          be used.*
6702
6703                                                          1. ds_load
6704                                                          2. s_waitcnt lgkmcnt(0)
6705
6706                                                            - If OpenCL, omit.
6707                                                            - Must happen before
6708                                                              any following
6709                                                              global/generic
6710                                                              load/load
6711                                                              atomic/store/store
6712                                                              atomic/atomicrmw.
6713                                                            - Ensures any
6714                                                              following global
6715                                                              data read is no
6716                                                              older than the local load
6717                                                              atomic value being
6718                                                              acquired.
6719
6720      load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
6721
6722                                                            - If not TgSplit execution
6723                                                              mode, omit glc=1.
6724
6725                                                          2. s_waitcnt lgkm/vmcnt(0)
6726
6727                                                            - Use lgkmcnt(0) if not
6728                                                              TgSplit execution mode
6729                                                              and vmcnt(0) if TgSplit
6730                                                              execution mode.
6731                                                            - If OpenCL, omit lgkmcnt(0).
6732                                                            - Must happen before
6733                                                              the following
6734                                                              buffer_wbinvl1_vol and any
6735                                                              following global/generic
6736                                                              load/load
6737                                                              atomic/store/store
6738                                                              atomic/atomicrmw.
6739                                                            - Ensures any
6740                                                              following global
6741                                                              data read is no
6742                                                              older than a local load
6743                                                              atomic value being
6744                                                              acquired.
6745
6746                                                          3. buffer_wbinvl1_vol
6747
6748                                                            - If not TgSplit execution
6749                                                              mode, omit.
6750                                                            - Ensures that
6751                                                              following
6752                                                              loads will not see
6753                                                              stale data.
6754
6755      load atomic  acquire      - agent        - global   1. buffer/global_load
6756                                                             glc=1
6757                                                          2. s_waitcnt vmcnt(0)
6758
6759                                                            - Must happen before
6760                                                              following
6761                                                              buffer_wbinvl1_vol.
6762                                                            - Ensures the load
6763                                                              has completed
6764                                                              before invalidating
6765                                                              the cache.
6766
6767                                                          3. buffer_wbinvl1_vol
6768
6769                                                            - Must happen before
6770                                                              any following
6771                                                              global/generic
6772                                                              load/load
6773                                                              atomic/atomicrmw.
6774                                                            - Ensures that
6775                                                              following
6776                                                              loads will not see
6777                                                              stale global data.
6778
6779      load atomic  acquire      - system       - global   1. buffer/global/flat_load
6780                                                             glc=1
6781                                                          2. s_waitcnt vmcnt(0)
6782
6783                                                            - Must happen before
6784                                                              following buffer_invl2 and
6785                                                              buffer_wbinvl1_vol.
6786                                                            - Ensures the load
6787                                                              has completed
6788                                                              before invalidating
6789                                                              the cache.
6790
6791                                                          3. buffer_invl2;
6792                                                             buffer_wbinvl1_vol
6793
6794                                                            - Must happen before
6795                                                              any following
6796                                                              global/generic
6797                                                              load/load
6798                                                              atomic/atomicrmw.
6799                                                            - Ensures that
6800                                                              following
6801                                                              loads will not see
6802                                                              stale L1 global data,
6803                                                              nor see stale L2 MTYPE
6804                                                              NC global data.
6805                                                              MTYPE RW and CC memory will
6806                                                              never be stale in L2 due to
6807                                                              the memory probes.
6808
6809      load atomic  acquire      - agent        - generic  1. flat_load glc=1
6810                                                          2. s_waitcnt vmcnt(0) &
6811                                                             lgkmcnt(0)
6812
6813                                                            - If TgSplit execution mode,
6814                                                              omit lgkmcnt(0).
6815                                                            - If OpenCL omit
6816                                                              lgkmcnt(0).
6817                                                            - Must happen before
6818                                                              following
6819                                                              buffer_wbinvl1_vol.
6820                                                            - Ensures the flat_load
6821                                                              has completed
6822                                                              before invalidating
6823                                                              the cache.
6824
6825                                                          3. buffer_wbinvl1_vol
6826
6827                                                            - Must happen before
6828                                                              any following
6829                                                              global/generic
6830                                                              load/load
6831                                                              atomic/atomicrmw.
6832                                                            - Ensures that
6833                                                              following loads
6834                                                              will not see stale
6835                                                              global data.
6836
6837      load atomic  acquire      - system       - generic  1. flat_load glc=1
6838                                                          2. s_waitcnt vmcnt(0) &
6839                                                             lgkmcnt(0)
6840
6841                                                            - If TgSplit execution mode,
6842                                                              omit lgkmcnt(0).
6843                                                            - If OpenCL omit
6844                                                              lgkmcnt(0).
6845                                                            - Must happen before
6846                                                              following
6847                                                              buffer_invl2 and
6848                                                              buffer_wbinvl1_vol.
6849                                                            - Ensures the flat_load
6850                                                              has completed
6851                                                              before invalidating
6852                                                              the caches.
6853
6854                                                          3. buffer_invl2;
6855                                                             buffer_wbinvl1_vol
6856
6857                                                            - Must happen before
6858                                                              any following
6859                                                              global/generic
6860                                                              load/load
6861                                                              atomic/atomicrmw.
6862                                                            - Ensures that
6863                                                              following
6864                                                              loads will not see
6865                                                              stale L1 global data,
6866                                                              nor see stale L2 MTYPE
6867                                                              NC global data.
6868                                                              MTYPE RW and CC memory will
6869                                                              never be stale in L2 due to
6870                                                              the memory probes.
6871
6872      atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
6873                                - wavefront    - generic
6874      atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
6875                                - wavefront               local address space cannot
6876                                                          be used.*
6877
6878                                                          1. ds_atomic
6879      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
6880                                                          2. s_waitcnt vmcnt(0)
6881
6882                                                            - If not TgSplit execution
6883                                                              mode, omit.
6884                                                            - Must happen before the
6885                                                              following buffer_wbinvl1_vol.
6886                                                            - Ensures the atomicrmw
6887                                                              has completed
6888                                                              before invalidating
6889                                                              the cache.
6890
6891                                                          3. buffer_wbinvl1_vol
6892
6893                                                            - If not TgSplit execution
6894                                                              mode, omit.
6895                                                            - Must happen before
6896                                                              any following
6897                                                              global/generic
6898                                                              load/load
6899                                                              atomic/atomicrmw.
6900                                                            - Ensures that
6901                                                              following loads
6902                                                              will not see stale
6903                                                              global data.
6904
6905      atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
6906                                                          local address space cannot
6907                                                          be used.*
6908
6909                                                          1. ds_atomic
6910                                                          2. s_waitcnt lgkmcnt(0)
6911
6912                                                            - If OpenCL, omit.
6913                                                            - Must happen before
6914                                                              any following
6915                                                              global/generic
6916                                                              load/load
6917                                                              atomic/store/store
6918                                                              atomic/atomicrmw.
6919                                                            - Ensures any
6920                                                              following global
6921                                                              data read is no
6922                                                              older than the local
6923                                                              atomicrmw value
6924                                                              being acquired.
6925
6926      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
6927                                                          2. s_waitcnt lgkm/vmcnt(0)
6928
6929                                                            - Use lgkmcnt(0) if not
6930                                                              TgSplit execution mode
6931                                                              and vmcnt(0) if TgSplit
6932                                                              execution mode.
6933                                                            - If OpenCL, omit lgkmcnt(0).
6934                                                            - Must happen before
6935                                                              the following
6936                                                              buffer_wbinvl1_vol and
6937                                                              any following
6938                                                              global/generic
6939                                                              load/load
6940                                                              atomic/store/store
6941                                                              atomic/atomicrmw.
6942                                                            - Ensures any
6943                                                              following global
6944                                                              data read is no
6945                                                              older than a local
6946                                                              atomicrmw value
6947                                                              being acquired.
6948
6949                                                          3. buffer_wbinvl1_vol
6950
6951                                                            - If not TgSplit execution
6952                                                              mode, omit.
6953                                                            - Ensures that
6954                                                              following
6955                                                              loads will not see
6956                                                              stale data.
6957
6958      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
6959                                                          2. s_waitcnt vmcnt(0)
6960
6961                                                            - Must happen before
6962                                                              following
6963                                                              buffer_wbinvl1_vol.
6964                                                            - Ensures the
6965                                                              atomicrmw has
6966                                                              completed before
6967                                                              invalidating the
6968                                                              cache.
6969
6970                                                          3. buffer_wbinvl1_vol
6971
6972                                                            - Must happen before
6973                                                              any following
6974                                                              global/generic
6975                                                              load/load
6976                                                              atomic/atomicrmw.
6977                                                            - Ensures that
6978                                                              following loads
6979                                                              will not see stale
6980                                                              global data.
6981
6982      atomicrmw    acquire      - system       - global   1. buffer/global_atomic
6983                                                          2. s_waitcnt vmcnt(0)
6984
6985                                                            - Must happen before
6986                                                              following buffer_invl2 and
6987                                                              buffer_wbinvl1_vol.
6988                                                            - Ensures the
6989                                                              atomicrmw has
6990                                                              completed before
6991                                                              invalidating the
6992                                                              caches.
6993
6994                                                          3. buffer_invl2;
6995                                                             buffer_wbinvl1_vol
6996
6997                                                            - Must happen before
6998                                                              any following
6999                                                              global/generic
7000                                                              load/load
7001                                                              atomic/atomicrmw.
7002                                                            - Ensures that
7003                                                              following
7004                                                              loads will not see
7005                                                              stale L1 global data,
7006                                                              nor see stale L2 MTYPE
7007                                                              NC global data.
7008                                                              MTYPE RW and CC memory will
7009                                                              never be stale in L2 due to
7010                                                              the memory probes.
7011
7012      atomicrmw    acquire      - agent        - generic  1. flat_atomic
7013                                                          2. s_waitcnt vmcnt(0) &
7014                                                             lgkmcnt(0)
7015
7016                                                            - If TgSplit execution mode,
7017                                                              omit lgkmcnt(0).
7018                                                            - If OpenCL, omit
7019                                                              lgkmcnt(0).
7020                                                            - Must happen before
7021                                                              following
7022                                                              buffer_wbinvl1_vol.
7023                                                            - Ensures the
7024                                                              atomicrmw has
7025                                                              completed before
7026                                                              invalidating the
7027                                                              cache.
7028
7029                                                          3. buffer_wbinvl1_vol
7030
7031                                                            - Must happen before
7032                                                              any following
7033                                                              global/generic
7034                                                              load/load
7035                                                              atomic/atomicrmw.
7036                                                            - Ensures that
7037                                                              following loads
7038                                                              will not see stale
7039                                                              global data.
7040
7041      atomicrmw    acquire      - system       - generic  1. flat_atomic
7042                                                          2. s_waitcnt vmcnt(0) &
7043                                                             lgkmcnt(0)
7044
7045                                                            - If TgSplit execution mode,
7046                                                              omit lgkmcnt(0).
7047                                                            - If OpenCL, omit
7048                                                              lgkmcnt(0).
7049                                                            - Must happen before
7050                                                              following
7051                                                              buffer_invl2 and
7052                                                              buffer_wbinvl1_vol.
7053                                                            - Ensures the
7054                                                              atomicrmw has
7055                                                              completed before
7056                                                              invalidating the
7057                                                              caches.
7058
7059                                                          3. buffer_invl2;
7060                                                             buffer_wbinvl1_vol
7061
7062                                                            - Must happen before
7063                                                              any following
7064                                                              global/generic
7065                                                              load/load
7066                                                              atomic/atomicrmw.
7067                                                            - Ensures that
7068                                                              following
7069                                                              loads will not see
7070                                                              stale L1 global data,
7071                                                              nor see stale L2 MTYPE
7072                                                              NC global data.
7073                                                              MTYPE RW and CC memory will
7074                                                              never be stale in L2 due to
7075                                                              the memory probes.
7076
7077      fence        acquire      - singlethread *none*     *none*
7078                                - wavefront
7079      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7080
7081                                                            - Use lgkmcnt(0) if not
7082                                                              TgSplit execution mode
7083                                                              and vmcnt(0) if TgSplit
7084                                                              execution mode.
7085                                                            - If OpenCL and
7086                                                              address space is
7087                                                              not generic, omit
7088                                                              lgkmcnt(0).
7089                                                            - If OpenCL and
7090                                                              address space is
7091                                                              local, omit
7092                                                              vmcnt(0).
7093                                                            - However, since LLVM
7094                                                              currently has no
7095                                                              address space on
7096                                                              the fence need to
7097                                                              conservatively
7098                                                              always generate. If
7099                                                              fence had an
7100                                                              address space then
7101                                                              set to address
7102                                                              space of OpenCL
7103                                                              fence flag, or to
7104                                                              generic if both
7105                                                              local and global
7106                                                              flags are
7107                                                              specified.
7108                                                            - s_waitcnt vmcnt(0)
7109                                                              must happen after
7110                                                              any preceding
7111                                                              global/generic load
7112                                                              atomic/
7113                                                              atomicrmw
7114                                                              with an equal or
7115                                                              wider sync scope
7116                                                              and memory ordering
7117                                                              stronger than
7118                                                              unordered (this is
7119                                                              termed the
7120                                                              fence-paired-atomic).
7121                                                            - s_waitcnt lgkmcnt(0)
7122                                                              must happen after
7123                                                              any preceding
7124                                                              local/generic load
7125                                                              atomic/atomicrmw
7126                                                              with an equal or
7127                                                              wider sync scope
7128                                                              and memory ordering
7129                                                              stronger than
7130                                                              unordered (this is
7131                                                              termed the
7132                                                              fence-paired-atomic).
7133                                                            - Must happen before
7134                                                              the following
7135                                                              buffer_wbinvl1_vol and
7136                                                              any following
7137                                                              global/generic
7138                                                              load/load
7139                                                              atomic/store/store
7140                                                              atomic/atomicrmw.
7141                                                            - Ensures any
7142                                                              following global
7143                                                              data read is no
7144                                                              older than the
7145                                                              value read by the
7146                                                              fence-paired-atomic.
7147
7148                                                          2. buffer_wbinvl1_vol
7149
7150                                                            - If not TgSplit execution
7151                                                              mode, omit.
7152                                                            - Ensures that
7153                                                              following
7154                                                              loads will not see
7155                                                              stale data.
7156
7157      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7158                                                             vmcnt(0)
7159
7160                                                            - If TgSplit execution mode,
7161                                                              omit lgkmcnt(0).
7162                                                            - If OpenCL and
7163                                                              address space is
7164                                                              not generic, omit
7165                                                              lgkmcnt(0).
7166                                                            - However, since LLVM
7167                                                              currently has no
7168                                                              address space on
7169                                                              the fence need to
7170                                                              conservatively
7171                                                              always generate
7172                                                              (see comment for
7173                                                              previous fence).
7174                                                            - Could be split into
7175                                                              separate s_waitcnt
7176                                                              vmcnt(0) and
7177                                                              s_waitcnt
7178                                                              lgkmcnt(0) to allow
7179                                                              them to be
7180                                                              independently moved
7181                                                              according to the
7182                                                              following rules.
7183                                                            - s_waitcnt vmcnt(0)
7184                                                              must happen after
7185                                                              any preceding
7186                                                              global/generic load
7187                                                              atomic/atomicrmw
7188                                                              with an equal or
7189                                                              wider sync scope
7190                                                              and memory ordering
7191                                                              stronger than
7192                                                              unordered (this is
7193                                                              termed the
7194                                                              fence-paired-atomic).
7195                                                            - s_waitcnt lgkmcnt(0)
7196                                                              must happen after
7197                                                              any preceding
7198                                                              local/generic load
7199                                                              atomic/atomicrmw
7200                                                              with an equal or
7201                                                              wider sync scope
7202                                                              and memory ordering
7203                                                              stronger than
7204                                                              unordered (this is
7205                                                              termed the
7206                                                              fence-paired-atomic).
7207                                                            - Must happen before
7208                                                              the following
7209                                                              buffer_wbinvl1_vol.
7210                                                            - Ensures that the
7211                                                              fence-paired atomic
7212                                                              has completed
7213                                                              before invalidating
7214                                                              the
7215                                                              cache. Therefore
7216                                                              any following
7217                                                              locations read must
7218                                                              be no older than
7219                                                              the value read by
7220                                                              the
7221                                                              fence-paired-atomic.
7222
7223                                                          2. buffer_wbinvl1_vol
7224
7225                                                            - Must happen before any
7226                                                              following global/generic
7227                                                              load/load
7228                                                              atomic/store/store
7229                                                              atomic/atomicrmw.
7230                                                            - Ensures that
7231                                                              following loads
7232                                                              will not see stale
7233                                                              global data.
7234
7235      fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
7236                                                             vmcnt(0)
7237
7238                                                            - If TgSplit execution mode,
7239                                                              omit lgkmcnt(0).
7240                                                            - If OpenCL and
7241                                                              address space is
7242                                                              not generic, omit
7243                                                              lgkmcnt(0).
7244                                                            - However, since LLVM
7245                                                              currently has no
7246                                                              address space on
7247                                                              the fence need to
7248                                                              conservatively
7249                                                              always generate
7250                                                              (see comment for
7251                                                              previous fence).
7252                                                            - Could be split into
7253                                                              separate s_waitcnt
7254                                                              vmcnt(0) and
7255                                                              s_waitcnt
7256                                                              lgkmcnt(0) to allow
7257                                                              them to be
7258                                                              independently moved
7259                                                              according to the
7260                                                              following rules.
7261                                                            - s_waitcnt vmcnt(0)
7262                                                              must happen after
7263                                                              any preceding
7264                                                              global/generic load
7265                                                              atomic/atomicrmw
7266                                                              with an equal or
7267                                                              wider sync scope
7268                                                              and memory ordering
7269                                                              stronger than
7270                                                              unordered (this is
7271                                                              termed the
7272                                                              fence-paired-atomic).
7273                                                            - s_waitcnt lgkmcnt(0)
7274                                                              must happen after
7275                                                              any preceding
7276                                                              local/generic load
7277                                                              atomic/atomicrmw
7278                                                              with an equal or
7279                                                              wider sync scope
7280                                                              and memory ordering
7281                                                              stronger than
7282                                                              unordered (this is
7283                                                              termed the
7284                                                              fence-paired-atomic).
7285                                                            - Must happen before
7286                                                              the following buffer_invl2 and
7287                                                              buffer_wbinvl1_vol.
7288                                                            - Ensures that the
7289                                                              fence-paired atomic
7290                                                              has completed
7291                                                              before invalidating
7292                                                              the
7293                                                              cache. Therefore
7294                                                              any following
7295                                                              locations read must
7296                                                              be no older than
7297                                                              the value read by
7298                                                              the
7299                                                              fence-paired-atomic.
7300
7301                                                          2. buffer_invl2;
7302                                                             buffer_wbinvl1_vol
7303
7304                                                            - Must happen before any
7305                                                              following global/generic
7306                                                              load/load
7307                                                              atomic/store/store
7308                                                              atomic/atomicrmw.
7309                                                            - Ensures that
7310                                                              following
7311                                                              loads will not see
7312                                                              stale L1 global data,
7313                                                              nor see stale L2 MTYPE
7314                                                              NC global data.
7315                                                              MTYPE RW and CC memory will
7316                                                              never be stale in L2 due to
7317                                                              the memory probes.
7318      **Release Atomic**
7319      ------------------------------------------------------------------------------------
7320      store atomic release      - singlethread - global   1. buffer/global/flat_store
7321                                - wavefront    - generic
7322      store atomic release      - singlethread - local    *If TgSplit execution mode,
7323                                - wavefront               local address space cannot
7324                                                          be used.*
7325
7326                                                          1. ds_store
7327      store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7328                                               - generic
7329                                                            - Use lgkmcnt(0) if not
7330                                                              TgSplit execution mode
7331                                                              and vmcnt(0) if TgSplit
7332                                                              execution mode.
7333                                                            - If OpenCL, omit lgkmcnt(0).
7334                                                            - s_waitcnt vmcnt(0)
7335                                                              must happen after
7336                                                              any preceding
7337                                                              global/generic load/store/
7338                                                              load atomic/store atomic/
7339                                                              atomicrmw.
7340                                                            - s_waitcnt lgkmcnt(0)
7341                                                              must happen after
7342                                                              any preceding
7343                                                              local/generic
7344                                                              load/store/load
7345                                                              atomic/store
7346                                                              atomic/atomicrmw.
7347                                                            - Must happen before
7348                                                              the following
7349                                                              store.
7350                                                            - Ensures that all
7351                                                              memory operations
7352                                                              have
7353                                                              completed before
7354                                                              performing the
7355                                                              store that is being
7356                                                              released.
7357
7358                                                          2. buffer/global/flat_store
7359      store atomic release      - workgroup    - local    *If TgSplit execution mode,
7360                                                          local address space cannot
7361                                                          be used.*
7362
7363                                                          1. ds_store
7364      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7365                                               - generic     vmcnt(0)
7366
7367                                                            - If TgSplit execution mode,
7368                                                              omit lgkmcnt(0).
7369                                                            - If OpenCL and
7370                                                              address space is
7371                                                              not generic, omit
7372                                                              lgkmcnt(0).
7373                                                            - Could be split into
7374                                                              separate s_waitcnt
7375                                                              vmcnt(0) and
7376                                                              s_waitcnt
7377                                                              lgkmcnt(0) to allow
7378                                                              them to be
7379                                                              independently moved
7380                                                              according to the
7381                                                              following rules.
7382                                                            - s_waitcnt vmcnt(0)
7383                                                              must happen after
7384                                                              any preceding
7385                                                              global/generic
7386                                                              load/store/load
7387                                                              atomic/store
7388                                                              atomic/atomicrmw.
7389                                                            - s_waitcnt lgkmcnt(0)
7390                                                              must happen after
7391                                                              any preceding
7392                                                              local/generic
7393                                                              load/store/load
7394                                                              atomic/store
7395                                                              atomic/atomicrmw.
7396                                                            - Must happen before
7397                                                              the following
7398                                                              store.
7399                                                            - Ensures that all
7400                                                              memory operations
7401                                                              to memory have
7402                                                              completed before
7403                                                              performing the
7404                                                              store that is being
7405                                                              released.
7406
7407                                                          2. buffer/global/flat_store
7408      store atomic release      - system       - global   1. buffer_wbl2
7409                                               - generic
7410                                                            - Must happen before
7411                                                              following s_waitcnt.
7412                                                            - Performs L2 writeback to
7413                                                              ensure previous
7414                                                              global/generic
7415                                                              store/atomicrmw are
7416                                                              visible at system scope.
7417
7418                                                          2. s_waitcnt lgkmcnt(0) &
7419                                                             vmcnt(0)
7420
7421                                                            - If TgSplit execution mode,
7422                                                              omit lgkmcnt(0).
7423                                                            - If OpenCL and
7424                                                              address space is
7425                                                              not generic, omit
7426                                                              lgkmcnt(0).
7427                                                            - Could be split into
7428                                                              separate s_waitcnt
7429                                                              vmcnt(0) and
7430                                                              s_waitcnt
7431                                                              lgkmcnt(0) to allow
7432                                                              them to be
7433                                                              independently moved
7434                                                              according to the
7435                                                              following rules.
7436                                                            - s_waitcnt vmcnt(0)
7437                                                              must happen after any
7438                                                              preceding
7439                                                              global/generic
7440                                                              load/store/load
7441                                                              atomic/store
7442                                                              atomic/atomicrmw.
7443                                                            - s_waitcnt lgkmcnt(0)
7444                                                              must happen after any
7445                                                              preceding
7446                                                              local/generic
7447                                                              load/store/load
7448                                                              atomic/store
7449                                                              atomic/atomicrmw.
7450                                                            - Must happen before
7451                                                              the following
7452                                                              store.
7453                                                            - Ensures that all
7454                                                              memory operations
7455                                                              to memory and the L2
7456                                                              writeback have
7457                                                              completed before
7458                                                              performing the
7459                                                              store that is being
7460                                                              released.
7461
7462                                                          3. buffer/global/flat_store
7463      atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
7464                                - wavefront    - generic
7465      atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
7466                                - wavefront               local address space cannot
7467                                                          be used.*
7468
7469                                                          1. ds_atomic
7470      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7471                                               - generic
7472                                                            - Use lgkmcnt(0) if not
7473                                                              TgSplit execution mode
7474                                                              and vmcnt(0) if TgSplit
7475                                                              execution mode.
7476                                                            - If OpenCL, omit
7477                                                              lgkmcnt(0).
7478                                                            - s_waitcnt vmcnt(0)
7479                                                              must happen after
7480                                                              any preceding
7481                                                              global/generic load/store/
7482                                                              load atomic/store atomic/
7483                                                              atomicrmw.
7484                                                            - s_waitcnt lgkmcnt(0)
7485                                                              must happen after
7486                                                              any preceding
7487                                                              local/generic
7488                                                              load/store/load
7489                                                              atomic/store
7490                                                              atomic/atomicrmw.
7491                                                            - Must happen before
7492                                                              the following
7493                                                              atomicrmw.
7494                                                            - Ensures that all
7495                                                              memory operations
7496                                                              have
7497                                                              completed before
7498                                                              performing the
7499                                                              atomicrmw that is
7500                                                              being released.
7501
7502                                                          2. buffer/global/flat_atomic
7503      atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
7504                                                          local address space cannot
7505                                                          be used.*
7506
7507                                                          1. ds_atomic
7508      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7509                                               - generic     vmcnt(0)
7510
7511                                                            - If TgSplit execution mode,
7512                                                              omit lgkmcnt(0).
7513                                                            - If OpenCL, omit
7514                                                              lgkmcnt(0).
7515                                                            - Could be split into
7516                                                              separate s_waitcnt
7517                                                              vmcnt(0) and
7518                                                              s_waitcnt
7519                                                              lgkmcnt(0) to allow
7520                                                              them to be
7521                                                              independently moved
7522                                                              according to the
7523                                                              following rules.
7524                                                            - s_waitcnt vmcnt(0)
7525                                                              must happen after
7526                                                              any preceding
7527                                                              global/generic
7528                                                              load/store/load
7529                                                              atomic/store
7530                                                              atomic/atomicrmw.
7531                                                            - s_waitcnt lgkmcnt(0)
7532                                                              must happen after
7533                                                              any preceding
7534                                                              local/generic
7535                                                              load/store/load
7536                                                              atomic/store
7537                                                              atomic/atomicrmw.
7538                                                            - Must happen before
7539                                                              the following
7540                                                              atomicrmw.
7541                                                            - Ensures that all
7542                                                              memory operations
7543                                                              to global and local
7544                                                              have completed
7545                                                              before performing
7546                                                              the atomicrmw that
7547                                                              is being released.
7548
7549                                                          2. buffer/global/flat_atomic
7550      atomicrmw    release      - system       - global   1. buffer_wbl2
7551                                               - generic
7552                                                            - Must happen before
7553                                                              following s_waitcnt.
7554                                                            - Performs L2 writeback to
7555                                                              ensure previous
7556                                                              global/generic
7557                                                              store/atomicrmw are
7558                                                              visible at system scope.
7559
7560                                                          2. s_waitcnt lgkmcnt(0) &
7561                                                             vmcnt(0)
7562
7563                                                            - If TgSplit execution mode,
7564                                                              omit lgkmcnt(0).
7565                                                            - If OpenCL, omit
7566                                                              lgkmcnt(0).
7567                                                            - Could be split into
7568                                                              separate s_waitcnt
7569                                                              vmcnt(0) and
7570                                                              s_waitcnt
7571                                                              lgkmcnt(0) to allow
7572                                                              them to be
7573                                                              independently moved
7574                                                              according to the
7575                                                              following rules.
7576                                                            - s_waitcnt vmcnt(0)
7577                                                              must happen after
7578                                                              any preceding
7579                                                              global/generic
7580                                                              load/store/load
7581                                                              atomic/store
7582                                                              atomic/atomicrmw.
7583                                                            - s_waitcnt lgkmcnt(0)
7584                                                              must happen after
7585                                                              any preceding
7586                                                              local/generic
7587                                                              load/store/load
7588                                                              atomic/store
7589                                                              atomic/atomicrmw.
7590                                                            - Must happen before
7591                                                              the following
7592                                                              atomicrmw.
7593                                                            - Ensures that all
7594                                                              memory operations
7595                                                              to memory and the L2
7596                                                              writeback have
7597                                                              completed before
7598                                                              performing the
7599                                                              store that is being
7600                                                              released.
7601
7602                                                          3. buffer/global/flat_atomic
7603      fence        release      - singlethread *none*     *none*
7604                                - wavefront
7605      fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7606
7607                                                            - Use lgkmcnt(0) if not
7608                                                              TgSplit execution mode
7609                                                              and vmcnt(0) if TgSplit
7610                                                              execution mode.
7611                                                            - If OpenCL and
7612                                                              address space is
7613                                                              not generic, omit
7614                                                              lgkmcnt(0).
7615                                                            - If OpenCL and
7616                                                              address space is
7617                                                              local, omit
7618                                                              vmcnt(0).
7619                                                            - However, since LLVM
7620                                                              currently has no
7621                                                              address space on
7622                                                              the fence need to
7623                                                              conservatively
7624                                                              always generate. If
7625                                                              fence had an
7626                                                              address space then
7627                                                              set to address
7628                                                              space of OpenCL
7629                                                              fence flag, or to
7630                                                              generic if both
7631                                                              local and global
7632                                                              flags are
7633                                                              specified.
7634                                                            - s_waitcnt vmcnt(0)
7635                                                              must happen after
7636                                                              any preceding
7637                                                              global/generic
7638                                                              load/store/
7639                                                              load atomic/store atomic/
7640                                                              atomicrmw.
7641                                                            - s_waitcnt lgkmcnt(0)
7642                                                              must happen after
7643                                                              any preceding
7644                                                              local/generic
7645                                                              load/load
7646                                                              atomic/store/store
7647                                                              atomic/atomicrmw.
7648                                                            - Must happen before
7649                                                              any following store
7650                                                              atomic/atomicrmw
7651                                                              with an equal or
7652                                                              wider sync scope
7653                                                              and memory ordering
7654                                                              stronger than
7655                                                              unordered (this is
7656                                                              termed the
7657                                                              fence-paired-atomic).
7658                                                            - Ensures that all
7659                                                              memory operations
7660                                                              have
7661                                                              completed before
7662                                                              performing the
7663                                                              following
7664                                                              fence-paired-atomic.
7665
7666      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7667                                                             vmcnt(0)
7668
7669                                                            - If TgSplit execution mode,
7670                                                              omit lgkmcnt(0).
7671                                                            - If OpenCL and
7672                                                              address space is
7673                                                              not generic, omit
7674                                                              lgkmcnt(0).
7675                                                            - If OpenCL and
7676                                                              address space is
7677                                                              local, omit
7678                                                              vmcnt(0).
7679                                                            - However, since LLVM
7680                                                              currently has no
7681                                                              address space on
7682                                                              the fence need to
7683                                                              conservatively
7684                                                              always generate. If
7685                                                              fence had an
7686                                                              address space then
7687                                                              set to address
7688                                                              space of OpenCL
7689                                                              fence flag, or to
7690                                                              generic if both
7691                                                              local and global
7692                                                              flags are
7693                                                              specified.
7694                                                            - Could be split into
7695                                                              separate s_waitcnt
7696                                                              vmcnt(0) and
7697                                                              s_waitcnt
7698                                                              lgkmcnt(0) to allow
7699                                                              them to be
7700                                                              independently moved
7701                                                              according to the
7702                                                              following rules.
7703                                                            - s_waitcnt vmcnt(0)
7704                                                              must happen after
7705                                                              any preceding
7706                                                              global/generic
7707                                                              load/store/load
7708                                                              atomic/store
7709                                                              atomic/atomicrmw.
7710                                                            - s_waitcnt lgkmcnt(0)
7711                                                              must happen after
7712                                                              any preceding
7713                                                              local/generic
7714                                                              load/store/load
7715                                                              atomic/store
7716                                                              atomic/atomicrmw.
7717                                                            - Must happen before
7718                                                              any following store
7719                                                              atomic/atomicrmw
7720                                                              with an equal or
7721                                                              wider sync scope
7722                                                              and memory ordering
7723                                                              stronger than
7724                                                              unordered (this is
7725                                                              termed the
7726                                                              fence-paired-atomic).
7727                                                            - Ensures that all
7728                                                              memory operations
7729                                                              have
7730                                                              completed before
7731                                                              performing the
7732                                                              following
7733                                                              fence-paired-atomic.
7734
7735      fence        release      - system       *none*     1. buffer_wbl2
7736
7737                                                            - If OpenCL and
7738                                                              address space is
7739                                                              local, omit.
7740                                                            - Must happen before
7741                                                              following s_waitcnt.
7742                                                            - Performs L2 writeback to
7743                                                              ensure previous
7744                                                              global/generic
7745                                                              store/atomicrmw are
7746                                                              visible at system scope.
7747
7748                                                          2. s_waitcnt lgkmcnt(0) &
7749                                                             vmcnt(0)
7750
7751                                                            - If TgSplit execution mode,
7752                                                              omit lgkmcnt(0).
7753                                                            - If OpenCL and
7754                                                              address space is
7755                                                              not generic, omit
7756                                                              lgkmcnt(0).
7757                                                            - If OpenCL and
7758                                                              address space is
7759                                                              local, omit
7760                                                              vmcnt(0).
7761                                                            - However, since LLVM
7762                                                              currently has no
7763                                                              address space on
7764                                                              the fence need to
7765                                                              conservatively
7766                                                              always generate. If
7767                                                              fence had an
7768                                                              address space then
7769                                                              set to address
7770                                                              space of OpenCL
7771                                                              fence flag, or to
7772                                                              generic if both
7773                                                              local and global
7774                                                              flags are
7775                                                              specified.
7776                                                            - Could be split into
7777                                                              separate s_waitcnt
7778                                                              vmcnt(0) and
7779                                                              s_waitcnt
7780                                                              lgkmcnt(0) to allow
7781                                                              them to be
7782                                                              independently moved
7783                                                              according to the
7784                                                              following rules.
7785                                                            - s_waitcnt vmcnt(0)
7786                                                              must happen after
7787                                                              any preceding
7788                                                              global/generic
7789                                                              load/store/load
7790                                                              atomic/store
7791                                                              atomic/atomicrmw.
7792                                                            - s_waitcnt lgkmcnt(0)
7793                                                              must happen after
7794                                                              any preceding
7795                                                              local/generic
7796                                                              load/store/load
7797                                                              atomic/store
7798                                                              atomic/atomicrmw.
7799                                                            - Must happen before
7800                                                              any following store
7801                                                              atomic/atomicrmw
7802                                                              with an equal or
7803                                                              wider sync scope
7804                                                              and memory ordering
7805                                                              stronger than
7806                                                              unordered (this is
7807                                                              termed the
7808                                                              fence-paired-atomic).
7809                                                            - Ensures that all
7810                                                              memory operations
7811                                                              have
7812                                                              completed before
7813                                                              performing the
7814                                                              following
7815                                                              fence-paired-atomic.
7816
7817      **Acquire-Release Atomic**
7818      ------------------------------------------------------------------------------------
7819      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
7820                                - wavefront    - generic
7821      atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
7822                                - wavefront               local address space cannot
7823                                                          be used.*
7824
7825                                                          1. ds_atomic
7826      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7827
7828                                                            - Use lgkmcnt(0) if not
7829                                                              TgSplit execution mode
7830                                                              and vmcnt(0) if TgSplit
7831                                                              execution mode.
7832                                                            - If OpenCL, omit
7833                                                              lgkmcnt(0).
7834                                                            - Must happen after
7835                                                              any preceding
7836                                                              local/generic
7837                                                              load/store/load
7838                                                              atomic/store
7839                                                              atomic/atomicrmw.
7840                                                            - s_waitcnt vmcnt(0)
7841                                                              must happen after
7842                                                              any preceding
7843                                                              global/generic load/store/
7844                                                              load atomic/store atomic/
7845                                                              atomicrmw.
7846                                                            - s_waitcnt lgkmcnt(0)
7847                                                              must happen after
7848                                                              any preceding
7849                                                              local/generic
7850                                                              load/store/load
7851                                                              atomic/store
7852                                                              atomic/atomicrmw.
7853                                                            - Must happen before
7854                                                              the following
7855                                                              atomicrmw.
7856                                                            - Ensures that all
7857                                                              memory operations
7858                                                              have
7859                                                              completed before
7860                                                              performing the
7861                                                              atomicrmw that is
7862                                                              being released.
7863
7864                                                          2. buffer/global_atomic
7865                                                          3. s_waitcnt vmcnt(0)
7866
7867                                                            - If not TgSplit execution
7868                                                              mode, omit.
7869                                                            - Must happen before
7870                                                              the following
7871                                                              buffer_wbinvl1_vol.
7872                                                            - Ensures any
7873                                                              following global
7874                                                              data read is no
7875                                                              older than the
7876                                                              atomicrmw value
7877                                                              being acquired.
7878
7879                                                          4. buffer_wbinvl1_vol
7880
7881                                                            - If not TgSplit execution
7882                                                              mode, omit.
7883                                                            - Ensures that
7884                                                              following
7885                                                              loads will not see
7886                                                              stale data.
7887
7888      atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
7889                                                          local address space cannot
7890                                                          be used.*
7891
7892                                                          1. ds_atomic
7893                                                          2. s_waitcnt lgkmcnt(0)
7894
7895                                                            - If OpenCL, omit.
7896                                                            - Must happen before
7897                                                              any following
7898                                                              global/generic
7899                                                              load/load
7900                                                              atomic/store/store
7901                                                              atomic/atomicrmw.
7902                                                            - Ensures any
7903                                                              following global
7904                                                              data read is no
7905                                                              older than the local load
7906                                                              atomic value being
7907                                                              acquired.
7908
7909      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
7910
7911                                                            - Use lgkmcnt(0) if not
7912                                                              TgSplit execution mode
7913                                                              and vmcnt(0) if TgSplit
7914                                                              execution mode.
7915                                                            - If OpenCL, omit
7916                                                              lgkmcnt(0).
7917                                                            - s_waitcnt vmcnt(0)
7918                                                              must happen after
7919                                                              any preceding
7920                                                              global/generic load/store/
7921                                                              load atomic/store atomic/
7922                                                              atomicrmw.
7923                                                            - s_waitcnt lgkmcnt(0)
7924                                                              must happen after
7925                                                              any preceding
7926                                                              local/generic
7927                                                              load/store/load
7928                                                              atomic/store
7929                                                              atomic/atomicrmw.
7930                                                            - Must happen before
7931                                                              the following
7932                                                              atomicrmw.
7933                                                            - Ensures that all
7934                                                              memory operations
7935                                                              have
7936                                                              completed before
7937                                                              performing the
7938                                                              atomicrmw that is
7939                                                              being released.
7940
7941                                                          2. flat_atomic
7942                                                          3. s_waitcnt lgkmcnt(0) &
7943                                                             vmcnt(0)
7944
7945                                                            - If not TgSplit execution
7946                                                              mode, omit vmcnt(0).
7947                                                            - If OpenCL, omit
7948                                                              lgkmcnt(0).
7949                                                            - Must happen before
7950                                                              the following
7951                                                              buffer_wbinvl1_vol and
7952                                                              any following
7953                                                              global/generic
7954                                                              load/load
7955                                                              atomic/store/store
7956                                                              atomic/atomicrmw.
7957                                                            - Ensures any
7958                                                              following global
7959                                                              data read is no
7960                                                              older than a local load
7961                                                              atomic value being
7962                                                              acquired.
7963
7964                                                          3. buffer_wbinvl1_vol
7965
7966                                                            - If not TgSplit execution
7967                                                              mode, omit.
7968                                                            - Ensures that
7969                                                              following
7970                                                              loads will not see
7971                                                              stale data.
7972
7973      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7974                                                             vmcnt(0)
7975
7976                                                            - If TgSplit execution mode,
7977                                                              omit lgkmcnt(0).
7978                                                            - If OpenCL, omit
7979                                                              lgkmcnt(0).
7980                                                            - Could be split into
7981                                                              separate s_waitcnt
7982                                                              vmcnt(0) and
7983                                                              s_waitcnt
7984                                                              lgkmcnt(0) to allow
7985                                                              them to be
7986                                                              independently moved
7987                                                              according to the
7988                                                              following rules.
7989                                                            - s_waitcnt vmcnt(0)
7990                                                              must happen after
7991                                                              any preceding
7992                                                              global/generic
7993                                                              load/store/load
7994                                                              atomic/store
7995                                                              atomic/atomicrmw.
7996                                                            - s_waitcnt lgkmcnt(0)
7997                                                              must happen after
7998                                                              any preceding
7999                                                              local/generic
8000                                                              load/store/load
8001                                                              atomic/store
8002                                                              atomic/atomicrmw.
8003                                                            - Must happen before
8004                                                              the following
8005                                                              atomicrmw.
8006                                                            - Ensures that all
8007                                                              memory operations
8008                                                              to global have
8009                                                              completed before
8010                                                              performing the
8011                                                              atomicrmw that is
8012                                                              being released.
8013
8014                                                          2. buffer/global_atomic
8015                                                          3. s_waitcnt vmcnt(0)
8016
8017                                                            - Must happen before
8018                                                              following
8019                                                              buffer_wbinvl1_vol.
8020                                                            - Ensures the
8021                                                              atomicrmw has
8022                                                              completed before
8023                                                              invalidating the
8024                                                              cache.
8025
8026                                                          4. buffer_wbinvl1_vol
8027
8028                                                            - Must happen before
8029                                                              any following
8030                                                              global/generic
8031                                                              load/load
8032                                                              atomic/atomicrmw.
8033                                                            - Ensures that
8034                                                              following loads
8035                                                              will not see stale
8036                                                              global data.
8037
8038      atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
8039
8040                                                            - Must happen before
8041                                                              following s_waitcnt.
8042                                                            - Performs L2 writeback to
8043                                                              ensure previous
8044                                                              global/generic
8045                                                              store/atomicrmw are
8046                                                              visible at system scope.
8047
8048                                                          2. s_waitcnt lgkmcnt(0) &
8049                                                             vmcnt(0)
8050
8051                                                            - If TgSplit execution mode,
8052                                                              omit lgkmcnt(0).
8053                                                            - If OpenCL, omit
8054                                                              lgkmcnt(0).
8055                                                            - Could be split into
8056                                                              separate s_waitcnt
8057                                                              vmcnt(0) and
8058                                                              s_waitcnt
8059                                                              lgkmcnt(0) to allow
8060                                                              them to be
8061                                                              independently moved
8062                                                              according to the
8063                                                              following rules.
8064                                                            - s_waitcnt vmcnt(0)
8065                                                              must happen after
8066                                                              any preceding
8067                                                              global/generic
8068                                                              load/store/load
8069                                                              atomic/store
8070                                                              atomic/atomicrmw.
8071                                                            - s_waitcnt lgkmcnt(0)
8072                                                              must happen after
8073                                                              any preceding
8074                                                              local/generic
8075                                                              load/store/load
8076                                                              atomic/store
8077                                                              atomic/atomicrmw.
8078                                                            - Must happen before
8079                                                              the following
8080                                                              atomicrmw.
8081                                                            - Ensures that all
8082                                                              memory operations
8083                                                              to global and L2 writeback
8084                                                              have completed before
8085                                                              performing the
8086                                                              atomicrmw that is
8087                                                              being released.
8088
8089                                                          3. buffer/global_atomic
8090                                                          4. s_waitcnt vmcnt(0)
8091
8092                                                            - Must happen before
8093                                                              following buffer_invl2 and
8094                                                              buffer_wbinvl1_vol.
8095                                                            - Ensures the
8096                                                              atomicrmw has
8097                                                              completed before
8098                                                              invalidating the
8099                                                              caches.
8100
8101                                                          5. buffer_invl2;
8102                                                             buffer_wbinvl1_vol
8103
8104                                                            - Must happen before
8105                                                              any following
8106                                                              global/generic
8107                                                              load/load
8108                                                              atomic/atomicrmw.
8109                                                            - Ensures that
8110                                                              following
8111                                                              loads will not see
8112                                                              stale L1 global data,
8113                                                              nor see stale L2 MTYPE
8114                                                              NC global data.
8115                                                              MTYPE RW and CC memory will
8116                                                              never be stale in L2 due to
8117                                                              the memory probes.
8118
8119      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
8120                                                             vmcnt(0)
8121
8122                                                            - If TgSplit execution mode,
8123                                                              omit lgkmcnt(0).
8124                                                            - If OpenCL, omit
8125                                                              lgkmcnt(0).
8126                                                            - Could be split into
8127                                                              separate s_waitcnt
8128                                                              vmcnt(0) and
8129                                                              s_waitcnt
8130                                                              lgkmcnt(0) to allow
8131                                                              them to be
8132                                                              independently moved
8133                                                              according to the
8134                                                              following rules.
8135                                                            - s_waitcnt vmcnt(0)
8136                                                              must happen after
8137                                                              any preceding
8138                                                              global/generic
8139                                                              load/store/load
8140                                                              atomic/store
8141                                                              atomic/atomicrmw.
8142                                                            - s_waitcnt lgkmcnt(0)
8143                                                              must happen after
8144                                                              any preceding
8145                                                              local/generic
8146                                                              load/store/load
8147                                                              atomic/store
8148                                                              atomic/atomicrmw.
8149                                                            - Must happen before
8150                                                              the following
8151                                                              atomicrmw.
8152                                                            - Ensures that all
8153                                                              memory operations
8154                                                              to global have
8155                                                              completed before
8156                                                              performing the
8157                                                              atomicrmw that is
8158                                                              being released.
8159
8160                                                          2. flat_atomic
8161                                                          3. s_waitcnt vmcnt(0) &
8162                                                             lgkmcnt(0)
8163
8164                                                            - If TgSplit execution mode,
8165                                                              omit lgkmcnt(0).
8166                                                            - If OpenCL, omit
8167                                                              lgkmcnt(0).
8168                                                            - Must happen before
8169                                                              following
8170                                                              buffer_wbinvl1_vol.
8171                                                            - Ensures the
8172                                                              atomicrmw has
8173                                                              completed before
8174                                                              invalidating the
8175                                                              cache.
8176
8177                                                          4. buffer_wbinvl1_vol
8178
8179                                                            - Must happen before
8180                                                              any following
8181                                                              global/generic
8182                                                              load/load
8183                                                              atomic/atomicrmw.
8184                                                            - Ensures that
8185                                                              following loads
8186                                                              will not see stale
8187                                                              global data.
8188
8189      atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
8190
8191                                                            - Must happen before
8192                                                              following s_waitcnt.
8193                                                            - Performs L2 writeback to
8194                                                              ensure previous
8195                                                              global/generic
8196                                                              store/atomicrmw are
8197                                                              visible at system scope.
8198
8199                                                          2. s_waitcnt lgkmcnt(0) &
8200                                                             vmcnt(0)
8201
8202                                                            - If TgSplit execution mode,
8203                                                              omit lgkmcnt(0).
8204                                                            - If OpenCL, omit
8205                                                              lgkmcnt(0).
8206                                                            - Could be split into
8207                                                              separate s_waitcnt
8208                                                              vmcnt(0) and
8209                                                              s_waitcnt
8210                                                              lgkmcnt(0) to allow
8211                                                              them to be
8212                                                              independently moved
8213                                                              according to the
8214                                                              following rules.
8215                                                            - s_waitcnt vmcnt(0)
8216                                                              must happen after
8217                                                              any preceding
8218                                                              global/generic
8219                                                              load/store/load
8220                                                              atomic/store
8221                                                              atomic/atomicrmw.
8222                                                            - s_waitcnt lgkmcnt(0)
8223                                                              must happen after
8224                                                              any preceding
8225                                                              local/generic
8226                                                              load/store/load
8227                                                              atomic/store
8228                                                              atomic/atomicrmw.
8229                                                            - Must happen before
8230                                                              the following
8231                                                              atomicrmw.
8232                                                            - Ensures that all
8233                                                              memory operations
8234                                                              to global and L2 writeback
8235                                                              have completed before
8236                                                              performing the
8237                                                              atomicrmw that is
8238                                                              being released.
8239
8240                                                          3. flat_atomic
8241                                                          4. s_waitcnt vmcnt(0) &
8242                                                             lgkmcnt(0)
8243
8244                                                            - If TgSplit execution mode,
8245                                                              omit lgkmcnt(0).
8246                                                            - If OpenCL, omit
8247                                                              lgkmcnt(0).
8248                                                            - Must happen before
8249                                                              following buffer_invl2 and
8250                                                              buffer_wbinvl1_vol.
8251                                                            - Ensures the
8252                                                              atomicrmw has
8253                                                              completed before
8254                                                              invalidating the
8255                                                              caches.
8256
8257                                                          5. buffer_invl2;
8258                                                             buffer_wbinvl1_vol
8259
8260                                                            - Must happen before
8261                                                              any following
8262                                                              global/generic
8263                                                              load/load
8264                                                              atomic/atomicrmw.
8265                                                            - Ensures that
8266                                                              following
8267                                                              loads will not see
8268                                                              stale L1 global data,
8269                                                              nor see stale L2 MTYPE
8270                                                              NC global data.
8271                                                              MTYPE RW and CC memory will
8272                                                              never be stale in L2 due to
8273                                                              the memory probes.
8274
8275      fence        acq_rel      - singlethread *none*     *none*
8276                                - wavefront
8277      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
8278
8279                                                            - Use lgkmcnt(0) if not
8280                                                              TgSplit execution mode
8281                                                              and vmcnt(0) if TgSplit
8282                                                              execution mode.
8283                                                            - If OpenCL and
8284                                                              address space is
8285                                                              not generic, omit
8286                                                              lgkmcnt(0).
8287                                                            - If OpenCL and
8288                                                              address space is
8289                                                              local, omit
8290                                                              vmcnt(0).
8291                                                            - However,
8292                                                              since LLVM
8293                                                              currently has no
8294                                                              address space on
8295                                                              the fence need to
8296                                                              conservatively
8297                                                              always generate
8298                                                              (see comment for
8299                                                              previous fence).
8300                                                            - s_waitcnt vmcnt(0)
8301                                                              must happen after
8302                                                              any preceding
8303                                                              global/generic
8304                                                              load/store/
8305                                                              load atomic/store atomic/
8306                                                              atomicrmw.
8307                                                            - s_waitcnt lgkmcnt(0)
8308                                                              must happen after
8309                                                              any preceding
8310                                                              local/generic
8311                                                              load/load
8312                                                              atomic/store/store
8313                                                              atomic/atomicrmw.
8314                                                            - Must happen before
8315                                                              any following
8316                                                              global/generic
8317                                                              load/load
8318                                                              atomic/store/store
8319                                                              atomic/atomicrmw.
8320                                                            - Ensures that all
8321                                                              memory operations
8322                                                              have
8323                                                              completed before
8324                                                              performing any
8325                                                              following global
8326                                                              memory operations.
8327                                                            - Ensures that the
8328                                                              preceding
8329                                                              local/generic load
8330                                                              atomic/atomicrmw
8331                                                              with an equal or
8332                                                              wider sync scope
8333                                                              and memory ordering
8334                                                              stronger than
8335                                                              unordered (this is
8336                                                              termed the
8337                                                              acquire-fence-paired-atomic)
8338                                                              has completed
8339                                                              before following
8340                                                              global memory
8341                                                              operations. This
8342                                                              satisfies the
8343                                                              requirements of
8344                                                              acquire.
8345                                                            - Ensures that all
8346                                                              previous memory
8347                                                              operations have
8348                                                              completed before a
8349                                                              following
8350                                                              local/generic store
8351                                                              atomic/atomicrmw
8352                                                              with an equal or
8353                                                              wider sync scope
8354                                                              and memory ordering
8355                                                              stronger than
8356                                                              unordered (this is
8357                                                              termed the
8358                                                              release-fence-paired-atomic).
8359                                                              This satisfies the
8360                                                              requirements of
8361                                                              release.
8362                                                            - Must happen before
8363                                                              the following
8364                                                              buffer_wbinvl1_vol.
8365                                                            - Ensures that the
8366                                                              acquire-fence-paired
8367                                                              atomic has completed
8368                                                              before invalidating
8369                                                              the
8370                                                              cache. Therefore
8371                                                              any following
8372                                                              locations read must
8373                                                              be no older than
8374                                                              the value read by
8375                                                              the
8376                                                              acquire-fence-paired-atomic.
8377
8378                                                          2. buffer_wbinvl1_vol
8379
8380                                                            - If not TgSplit execution
8381                                                              mode, omit.
8382                                                            - Ensures that
8383                                                              following
8384                                                              loads will not see
8385                                                              stale data.
8386
8387      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
8388                                                             vmcnt(0)
8389
8390                                                            - If TgSplit execution mode,
8391                                                              omit lgkmcnt(0).
8392                                                            - If OpenCL and
8393                                                              address space is
8394                                                              not generic, omit
8395                                                              lgkmcnt(0).
8396                                                            - However, since LLVM
8397                                                              currently has no
8398                                                              address space on
8399                                                              the fence need to
8400                                                              conservatively
8401                                                              always generate
8402                                                              (see comment for
8403                                                              previous fence).
8404                                                            - Could be split into
8405                                                              separate s_waitcnt
8406                                                              vmcnt(0) and
8407                                                              s_waitcnt
8408                                                              lgkmcnt(0) to allow
8409                                                              them to be
8410                                                              independently moved
8411                                                              according to the
8412                                                              following rules.
8413                                                            - s_waitcnt vmcnt(0)
8414                                                              must happen after
8415                                                              any preceding
8416                                                              global/generic
8417                                                              load/store/load
8418                                                              atomic/store
8419                                                              atomic/atomicrmw.
8420                                                            - s_waitcnt lgkmcnt(0)
8421                                                              must happen after
8422                                                              any preceding
8423                                                              local/generic
8424                                                              load/store/load
8425                                                              atomic/store
8426                                                              atomic/atomicrmw.
8427                                                            - Must happen before
8428                                                              the following
8429                                                              buffer_wbinvl1_vol.
8430                                                            - Ensures that the
8431                                                              preceding
8432                                                              global/local/generic
8433                                                              load
8434                                                              atomic/atomicrmw
8435                                                              with an equal or
8436                                                              wider sync scope
8437                                                              and memory ordering
8438                                                              stronger than
8439                                                              unordered (this is
8440                                                              termed the
8441                                                              acquire-fence-paired-atomic)
8442                                                              has completed
8443                                                              before invalidating
8444                                                              the cache. This
8445                                                              satisfies the
8446                                                              requirements of
8447                                                              acquire.
8448                                                            - Ensures that all
8449                                                              previous memory
8450                                                              operations have
8451                                                              completed before a
8452                                                              following
8453                                                              global/local/generic
8454                                                              store
8455                                                              atomic/atomicrmw
8456                                                              with an equal or
8457                                                              wider sync scope
8458                                                              and memory ordering
8459                                                              stronger than
8460                                                              unordered (this is
8461                                                              termed the
8462                                                              release-fence-paired-atomic).
8463                                                              This satisfies the
8464                                                              requirements of
8465                                                              release.
8466
8467                                                          2. buffer_wbinvl1_vol
8468
8469                                                            - Must happen before
8470                                                              any following
8471                                                              global/generic
8472                                                              load/load
8473                                                              atomic/store/store
8474                                                              atomic/atomicrmw.
8475                                                            - Ensures that
8476                                                              following loads
8477                                                              will not see stale
8478                                                              global data. This
8479                                                              satisfies the
8480                                                              requirements of
8481                                                              acquire.
8482
8483      fence        acq_rel      - system       *none*     1. buffer_wbl2
8484
8485                                                            - If OpenCL and
8486                                                              address space is
8487                                                              local, omit.
8488                                                            - Must happen before
8489                                                              following s_waitcnt.
8490                                                            - Performs L2 writeback to
8491                                                              ensure previous
8492                                                              global/generic
8493                                                              store/atomicrmw are
8494                                                              visible at system scope.
8495
8496                                                          2. s_waitcnt lgkmcnt(0) &
8497                                                             vmcnt(0)
8498
8499                                                            - If TgSplit execution mode,
8500                                                              omit lgkmcnt(0).
8501                                                            - If OpenCL and
8502                                                              address space is
8503                                                              not generic, omit
8504                                                              lgkmcnt(0).
8505                                                            - However, since LLVM
8506                                                              currently has no
8507                                                              address space on
8508                                                              the fence need to
8509                                                              conservatively
8510                                                              always generate
8511                                                              (see comment for
8512                                                              previous fence).
8513                                                            - Could be split into
8514                                                              separate s_waitcnt
8515                                                              vmcnt(0) and
8516                                                              s_waitcnt
8517                                                              lgkmcnt(0) to allow
8518                                                              them to be
8519                                                              independently moved
8520                                                              according to the
8521                                                              following rules.
8522                                                            - s_waitcnt vmcnt(0)
8523                                                              must happen after
8524                                                              any preceding
8525                                                              global/generic
8526                                                              load/store/load
8527                                                              atomic/store
8528                                                              atomic/atomicrmw.
8529                                                            - s_waitcnt lgkmcnt(0)
8530                                                              must happen after
8531                                                              any preceding
8532                                                              local/generic
8533                                                              load/store/load
8534                                                              atomic/store
8535                                                              atomic/atomicrmw.
8536                                                            - Must happen before
8537                                                              the following buffer_invl2 and
8538                                                              buffer_wbinvl1_vol.
8539                                                            - Ensures that the
8540                                                              preceding
8541                                                              global/local/generic
8542                                                              load
8543                                                              atomic/atomicrmw
8544                                                              with an equal or
8545                                                              wider sync scope
8546                                                              and memory ordering
8547                                                              stronger than
8548                                                              unordered (this is
8549                                                              termed the
8550                                                              acquire-fence-paired-atomic)
8551                                                              has completed
8552                                                              before invalidating
8553                                                              the cache. This
8554                                                              satisfies the
8555                                                              requirements of
8556                                                              acquire.
8557                                                            - Ensures that all
8558                                                              previous memory
8559                                                              operations have
8560                                                              completed before a
8561                                                              following
8562                                                              global/local/generic
8563                                                              store
8564                                                              atomic/atomicrmw
8565                                                              with an equal or
8566                                                              wider sync scope
8567                                                              and memory ordering
8568                                                              stronger than
8569                                                              unordered (this is
8570                                                              termed the
8571                                                              release-fence-paired-atomic).
8572                                                              This satisfies the
8573                                                              requirements of
8574                                                              release.
8575
8576                                                          3.  buffer_invl2;
8577                                                              buffer_wbinvl1_vol
8578
8579                                                            - Must happen before
8580                                                              any following
8581                                                              global/generic
8582                                                              load/load
8583                                                              atomic/store/store
8584                                                              atomic/atomicrmw.
8585                                                            - Ensures that
8586                                                              following
8587                                                              loads will not see
8588                                                              stale L1 global data,
8589                                                              nor see stale L2 MTYPE
8590                                                              NC global data.
8591                                                              MTYPE RW and CC memory will
8592                                                              never be stale in L2 due to
8593                                                              the memory probes.
8594
8595      **Sequential Consistent Atomic**
8596      ------------------------------------------------------------------------------------
8597      load atomic  seq_cst      - singlethread - global   *Same as corresponding
8598                                - wavefront    - local    load atomic acquire,
8599                                               - generic  except must generate
8600                                                          all instructions even
8601                                                          for OpenCL.*
8602      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8603                                               - generic
8604                                                            - Use lgkmcnt(0) if not
8605                                                              TgSplit execution mode
8606                                                              and vmcnt(0) if TgSplit
8607                                                              execution mode.
8608                                                            - s_waitcnt lgkmcnt(0) must
8609                                                              happen after
8610                                                              preceding
8611                                                              local/generic load
8612                                                              atomic/store
8613                                                              atomic/atomicrmw
8614                                                              with memory
8615                                                              ordering of seq_cst
8616                                                              and with equal or
8617                                                              wider sync scope.
8618                                                              (Note that seq_cst
8619                                                              fences have their
8620                                                              own s_waitcnt
8621                                                              lgkmcnt(0) and so do
8622                                                              not need to be
8623                                                              considered.)
8624                                                            - s_waitcnt vmcnt(0)
8625                                                              must happen after
8626                                                              preceding
8627                                                              global/generic load
8628                                                              atomic/store
8629                                                              atomic/atomicrmw
8630                                                              with memory
8631                                                              ordering of seq_cst
8632                                                              and with equal or
8633                                                              wider sync scope.
8634                                                              (Note that seq_cst
8635                                                              fences have their
8636                                                              own s_waitcnt
8637                                                              vmcnt(0) and so do
8638                                                              not need to be
8639                                                              considered.)
8640                                                            - Ensures any
8641                                                              preceding
8642                                                              sequential
8643                                                              consistent global/local
8644                                                              memory instructions
8645                                                              have completed
8646                                                              before executing
8647                                                              this sequentially
8648                                                              consistent
8649                                                              instruction. This
8650                                                              prevents reordering
8651                                                              a seq_cst store
8652                                                              followed by a
8653                                                              seq_cst load. (Note
8654                                                              that seq_cst is
8655                                                              stronger than
8656                                                              acquire/release as
8657                                                              the reordering of
8658                                                              load acquire
8659                                                              followed by a store
8660                                                              release is
8661                                                              prevented by the
8662                                                              s_waitcnt of
8663                                                              the release, but
8664                                                              there is nothing
8665                                                              preventing a store
8666                                                              release followed by
8667                                                              load acquire from
8668                                                              completing out of
8669                                                              order. The s_waitcnt
8670                                                              could be placed after
8671                                                              seq_store or before
8672                                                              the seq_load. We
8673                                                              choose the load to
8674                                                              make the s_waitcnt be
8675                                                              as late as possible
8676                                                              so that the store
8677                                                              may have already
8678                                                              completed.)
8679
8680                                                          2. *Following
8681                                                             instructions same as
8682                                                             corresponding load
8683                                                             atomic acquire,
8684                                                             except must generate
8685                                                             all instructions even
8686                                                             for OpenCL.*
8687      load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
8688                                                          local address space cannot
8689                                                          be used.*
8690
8691                                                          *Same as corresponding
8692                                                          load atomic acquire,
8693                                                          except must generate
8694                                                          all instructions even
8695                                                          for OpenCL.*
8696
8697      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8698                                - system       - generic     vmcnt(0)
8699
8700                                                            - If TgSplit execution mode,
8701                                                              omit lgkmcnt(0).
8702                                                            - Could be split into
8703                                                              separate s_waitcnt
8704                                                              vmcnt(0)
8705                                                              and s_waitcnt
8706                                                              lgkmcnt(0) to allow
8707                                                              them to be
8708                                                              independently moved
8709                                                              according to the
8710                                                              following rules.
8711                                                            - s_waitcnt lgkmcnt(0)
8712                                                              must happen after
8713                                                              preceding
8714                                                              global/generic load
8715                                                              atomic/store
8716                                                              atomic/atomicrmw
8717                                                              with memory
8718                                                              ordering of seq_cst
8719                                                              and with equal or
8720                                                              wider sync scope.
8721                                                              (Note that seq_cst
8722                                                              fences have their
8723                                                              own s_waitcnt
8724                                                              lgkmcnt(0) and so do
8725                                                              not need to be
8726                                                              considered.)
8727                                                            - s_waitcnt vmcnt(0)
8728                                                              must happen after
8729                                                              preceding
8730                                                              global/generic load
8731                                                              atomic/store
8732                                                              atomic/atomicrmw
8733                                                              with memory
8734                                                              ordering of seq_cst
8735                                                              and with equal or
8736                                                              wider sync scope.
8737                                                              (Note that seq_cst
8738                                                              fences have their
8739                                                              own s_waitcnt
8740                                                              vmcnt(0) and so do
8741                                                              not need to be
8742                                                              considered.)
8743                                                            - Ensures any
8744                                                              preceding
8745                                                              sequential
8746                                                              consistent global
8747                                                              memory instructions
8748                                                              have completed
8749                                                              before executing
8750                                                              this sequentially
8751                                                              consistent
8752                                                              instruction. This
8753                                                              prevents reordering
8754                                                              a seq_cst store
8755                                                              followed by a
8756                                                              seq_cst load. (Note
8757                                                              that seq_cst is
8758                                                              stronger than
8759                                                              acquire/release as
8760                                                              the reordering of
8761                                                              load acquire
8762                                                              followed by a store
8763                                                              release is
8764                                                              prevented by the
8765                                                              s_waitcnt of
8766                                                              the release, but
8767                                                              there is nothing
8768                                                              preventing a store
8769                                                              release followed by
8770                                                              load acquire from
8771                                                              completing out of
8772                                                              order. The s_waitcnt
8773                                                              could be placed after
8774                                                              seq_store or before
8775                                                              the seq_load. We
8776                                                              choose the load to
8777                                                              make the s_waitcnt be
8778                                                              as late as possible
8779                                                              so that the store
8780                                                              may have already
8781                                                              completed.)
8782
8783                                                          2. *Following
8784                                                             instructions same as
8785                                                             corresponding load
8786                                                             atomic acquire,
8787                                                             except must generate
8788                                                             all instructions even
8789                                                             for OpenCL.*
8790      store atomic seq_cst      - singlethread - global   *Same as corresponding
8791                                - wavefront    - local    store atomic release,
8792                                - workgroup    - generic  except must generate
8793                                - agent                   all instructions even
8794                                - system                  for OpenCL.*
8795      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
8796                                - wavefront    - local    atomicrmw acq_rel,
8797                                - workgroup    - generic  except must generate
8798                                - agent                   all instructions even
8799                                - system                  for OpenCL.*
8800      fence        seq_cst      - singlethread *none*     *Same as corresponding
8801                                - wavefront               fence acq_rel,
8802                                - workgroup               except must generate
8803                                - agent                   all instructions even
8804                                - system                  for OpenCL.*
8805      ============ ============ ============== ========== ================================
8806
8807 .. _amdgpu-amdhsa-memory-model-gfx940:
8808
8809 Memory Model GFX940
8810 +++++++++++++++++++
8811
8812 For GFX940:
8813
8814 * Each agent has multiple shader arrays (SA).
8815 * Each SA has multiple compute units (CU).
8816 * Each CU has multiple SIMDs that execute wavefronts.
8817 * The wavefronts for a single work-group are executed in the same CU but may be
8818   executed by different SIMDs. The exception is when in tgsplit execution mode
8819   when the wavefronts may be executed by different SIMDs in different CUs.
8820 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
8821   executing on it. The exception is when in tgsplit execution mode when no LDS
8822   is allocated as wavefronts of the same work-group can be in different CUs.
8823 * All LDS operations of a CU are performed as wavefront wide operations in a
8824   global order and involve no caching. Completion is reported to a wavefront in
8825   execution order.
8826 * The LDS memory has multiple request queues shared by the SIMDs of a
8827   CU. Therefore, the LDS operations performed by different wavefronts of a
8828   work-group can be reordered relative to each other, which can result in
8829   reordering the visibility of vector memory operations with respect to LDS
8830   operations of other wavefronts in the same work-group. A ``s_waitcnt
8831   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8832   vector memory operations between wavefronts of a work-group, but not between
8833   operations performed by the same wavefront.
8834 * The vector memory operations are performed as wavefront wide operations and
8835   completion is reported to a wavefront in execution order. The exception is
8836   that ``flat_load/store/atomic`` instructions can report out of vector memory
8837   order if they access LDS memory, and out of LDS operation order if they access
8838   global memory.
8839 * The vector memory operations access a single vector L1 cache shared by all
8840   SIMDs a CU. Therefore:
8841
8842   * No special action is required for coherence between the lanes of a single
8843     wavefront.
8844
8845   * No special action is required for coherence between wavefronts in the same
8846     work-group since they execute on the same CU. The exception is when in
8847     tgsplit execution mode as wavefronts of the same work-group can be in
8848     different CUs and so a ``buffer_inv sc0`` is required which will invalidate
8849     the L1 cache.
8850
8851   * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
8852     between wavefronts executing in different work-groups as they may be
8853     executing on different CUs.
8854
8855   * Atomic read-modify-write instructions implicitly bypass the L1 cache.
8856     Therefore, they do not use the sc0 bit for coherence and instead use it to
8857     indicate if the instruction returns the original value being updated. They
8858     do use sc1 to indicate system or agent scope coherence.
8859
8860 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
8861   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
8862   scalar operations are used in a restricted way so do not impact the memory
8863   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
8864 * The vector and scalar memory operations use an L2 cache.
8865
8866   * The gfx940 can be configured as a number of smaller agents with each having
8867     a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
8868     larger agents with groups of CUs on each agent each sharing separate L2
8869     caches.
8870   * The L2 cache has independent channels to service disjoint ranges of virtual
8871     addresses.
8872   * Each CU has a separate request queue per channel for its associated L2.
8873     Therefore, the vector and scalar memory operations performed by wavefronts
8874     executing with different L1 caches and the same L2 cache can be reordered
8875     relative to each other.
8876   * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
8877     vector memory operations of different CUs. It ensures a previous vector
8878     memory operation has completed before executing a subsequent vector memory
8879     or LDS operation and so can be used to meet the requirements of acquire and
8880     release.
8881   * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
8882     (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
8883     the PTE C-bit set for memory not local to the L2.
8884
8885     * Any local memory cache lines will be automatically invalidated by writes
8886       from CUs associated with other L2 caches, or writes from the CPU, due to
8887       the cache probe caused by the PTE C-bit.
8888     * XGMI accesses from the CPU to local memory may be cached on the CPU.
8889       Subsequent access from the GPU will automatically invalidate or writeback
8890       the CPU cache due to the L2 probe filter.
8891     * To ensure coherence of local memory writes of CUs with different L1 caches
8892       in the same agent a ``buffer_wbl2`` is required. It does nothing if the
8893       agent is configured to have a single L2, or will writeback dirty L2 cache
8894       lines if configured to have multiple L2 caches.
8895     * To ensure coherence of local memory writes of CUs in different agents a
8896       ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
8897     * To ensure coherence of local memory reads of CUs with different L1 caches
8898       in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
8899       agent is configured to have a single L2, or will invalidate non-local L2
8900       cache lines if configured to have multiple L2 caches.
8901     * To ensure coherence of local memory reads of CUs in different agents a
8902       ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
8903       lines if configured to have multiple L2 caches.
8904
8905   * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
8906     UC (uncached) which bypasses the L2.
8907
8908 Scalar memory operations are only used to access memory that is proven to not
8909 change during the execution of the kernel dispatch. This includes constant
8910 address space and global address space for program scope ``const`` variables.
8911 Therefore, the kernel machine code does not have to maintain the scalar cache to
8912 ensure it is coherent with the vector caches. The scalar and vector caches are
8913 invalidated between kernel dispatches by CP since constant address space data
8914 may change between kernel dispatch executions. See
8915 :ref:`amdgpu-amdhsa-memory-spaces`.
8916
8917 The one exception is if scalar writes are used to spill SGPR registers. In this
8918 case the AMDGPU backend ensures the memory location used to spill is never
8919 accessed by vector memory operations at the same time. If scalar writes are used
8920 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8921 return since the locations may be used for vector memory instructions by a
8922 future wavefront that uses the same scratch area, or a function call that
8923 creates a frame at the same address, respectively. There is no need for a
8924 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8925
8926 For kernarg backing memory:
8927
8928 * CP invalidates the L1 cache at the start of each kernel dispatch.
8929 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
8930   memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
8931   cache. This also causes it to be treated as non-volatile and so is not
8932   invalidated by ``*_vol``.
8933 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8934   so the L2 cache will be coherent with the CPU and other agents.
8935
8936 Scratch backing memory (which is used for the private address space) is accessed
8937 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
8938 only accessed by a single thread, and is always write-before-read, there is
8939 never a need to invalidate these entries from the L1 cache. Hence all cache
8940 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
8941
8942 The code sequences used to implement the memory model for GFX940 are defined
8943 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`.
8944
8945   .. table:: AMDHSA Memory Model Code Sequences GFX940
8946      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table
8947
8948      ============ ============ ============== ========== ================================
8949      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
8950                   Ordering     Sync Scope     Address    GFX940
8951                                               Space
8952      ============ ============ ============== ========== ================================
8953      **Non-Atomic**
8954      ------------------------------------------------------------------------------------
8955      load         *none*       *none*         - global   - !volatile & !nontemporal
8956                                               - generic
8957                                               - private    1. buffer/global/flat_load
8958                                               - constant
8959                                                          - !volatile & nontemporal
8960
8961                                                            1. buffer/global/flat_load
8962                                                               nt=1
8963
8964                                                          - volatile
8965
8966                                                            1. buffer/global/flat_load
8967                                                               sc0=1 sc1=1
8968                                                            2. s_waitcnt vmcnt(0)
8969
8970                                                             - Must happen before
8971                                                               any following volatile
8972                                                               global/generic
8973                                                               load/store.
8974                                                             - Ensures that
8975                                                               volatile
8976                                                               operations to
8977                                                               different
8978                                                               addresses will not
8979                                                               be reordered by
8980                                                               hardware.
8981
8982      load         *none*       *none*         - local    1. ds_load
8983      store        *none*       *none*         - global   - !volatile & !nontemporal
8984                                               - generic
8985                                               - private    1. buffer/global/flat_store
8986                                               - constant
8987                                                          - !volatile & nontemporal
8988
8989                                                            1. buffer/global/flat_store
8990                                                               nt=1
8991
8992                                                          - volatile
8993
8994                                                            1. buffer/global/flat_store
8995                                                               sc0=1 sc1=1
8996                                                            2. s_waitcnt vmcnt(0)
8997
8998                                                             - Must happen before
8999                                                               any following volatile
9000                                                               global/generic
9001                                                               load/store.
9002                                                             - Ensures that
9003                                                               volatile
9004                                                               operations to
9005                                                               different
9006                                                               addresses will not
9007                                                               be reordered by
9008                                                               hardware.
9009
9010      store        *none*       *none*         - local    1. ds_store
9011      **Unordered Atomic**
9012      ------------------------------------------------------------------------------------
9013      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
9014      store atomic unordered    *any*          *any*      *Same as non-atomic*.
9015      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
9016      **Monotonic Atomic**
9017      ------------------------------------------------------------------------------------
9018      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
9019                                - wavefront    - generic
9020      load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
9021                                               - generic     sc0=1
9022      load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
9023                                - wavefront               local address space cannot
9024                                - workgroup               be used.*
9025
9026                                                          1. ds_load
9027      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
9028                                               - generic     sc1=1
9029      load atomic  monotonic    - system       - global   1. buffer/global/flat_load
9030                                               - generic     sc0=1 sc1=1
9031      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
9032                                - wavefront    - generic
9033      store atomic monotonic    - workgroup    - global   1. buffer/global/flat_store
9034                                               - generic     sc0=1
9035      store atomic monotonic    - agent        - global   1. buffer/global/flat_store
9036                                               - generic     sc1=1
9037      store atomic monotonic    - system       - global   1. buffer/global/flat_store
9038                                               - generic     sc0=1 sc1=1
9039      store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
9040                                - wavefront               local address space cannot
9041                                - workgroup               be used.*
9042
9043                                                          1. ds_store
9044      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
9045                                - wavefront    - generic
9046                                - workgroup
9047                                - agent
9048      atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
9049                                               - generic     sc1=1
9050      atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
9051                                - wavefront               local address space cannot
9052                                - workgroup               be used.*
9053
9054                                                          1. ds_atomic
9055      **Acquire Atomic**
9056      ------------------------------------------------------------------------------------
9057      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
9058                                - wavefront    - local
9059                                               - generic
9060      load atomic  acquire      - workgroup    - global   1. buffer/global_load sc0=1
9061                                                          2. s_waitcnt vmcnt(0)
9062
9063                                                            - If not TgSplit execution
9064                                                              mode, omit.
9065                                                            - Must happen before the
9066                                                              following buffer_inv.
9067
9068                                                          3. buffer_inv sc0=1
9069
9070                                                            - If not TgSplit execution
9071                                                              mode, omit.
9072                                                            - Must happen before
9073                                                              any following
9074                                                              global/generic
9075                                                              load/load
9076                                                              atomic/store/store
9077                                                              atomic/atomicrmw.
9078                                                            - Ensures that
9079                                                              following
9080                                                              loads will not see
9081                                                              stale data.
9082
9083      load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
9084                                                          local address space cannot
9085                                                          be used.*
9086
9087                                                          1. ds_load
9088                                                          2. s_waitcnt lgkmcnt(0)
9089
9090                                                            - If OpenCL, omit.
9091                                                            - Must happen before
9092                                                              any following
9093                                                              global/generic
9094                                                              load/load
9095                                                              atomic/store/store
9096                                                              atomic/atomicrmw.
9097                                                            - Ensures any
9098                                                              following global
9099                                                              data read is no
9100                                                              older than the local load
9101                                                              atomic value being
9102                                                              acquired.
9103
9104      load atomic  acquire      - workgroup    - generic  1. flat_load  sc0=1
9105                                                          2. s_waitcnt lgkm/vmcnt(0)
9106
9107                                                            - Use lgkmcnt(0) if not
9108                                                              TgSplit execution mode
9109                                                              and vmcnt(0) if TgSplit
9110                                                              execution mode.
9111                                                            - If OpenCL, omit lgkmcnt(0).
9112                                                            - Must happen before
9113                                                              the following
9114                                                              buffer_inv and any
9115                                                              following global/generic
9116                                                              load/load
9117                                                              atomic/store/store
9118                                                              atomic/atomicrmw.
9119                                                            - Ensures any
9120                                                              following global
9121                                                              data read is no
9122                                                              older than a local load
9123                                                              atomic value being
9124                                                              acquired.
9125
9126                                                          3. buffer_inv sc0=1
9127
9128                                                            - If not TgSplit execution
9129                                                              mode, omit.
9130                                                            - Ensures that
9131                                                              following
9132                                                              loads will not see
9133                                                              stale data.
9134
9135      load atomic  acquire      - agent        - global   1. buffer/global_load
9136                                                             sc1=1
9137                                                          2. s_waitcnt vmcnt(0)
9138
9139                                                            - Must happen before
9140                                                              following
9141                                                              buffer_inv.
9142                                                            - Ensures the load
9143                                                              has completed
9144                                                              before invalidating
9145                                                              the cache.
9146
9147                                                          3. buffer_inv sc1=1
9148
9149                                                            - Must happen before
9150                                                              any following
9151                                                              global/generic
9152                                                              load/load
9153                                                              atomic/atomicrmw.
9154                                                            - Ensures that
9155                                                              following
9156                                                              loads will not see
9157                                                              stale global data.
9158
9159      load atomic  acquire      - system       - global   1. buffer/global/flat_load
9160                                                             sc0=1 sc1=1
9161                                                          2. s_waitcnt vmcnt(0)
9162
9163                                                            - Must happen before
9164                                                              following
9165                                                              buffer_inv.
9166                                                            - Ensures the load
9167                                                              has completed
9168                                                              before invalidating
9169                                                              the cache.
9170
9171                                                          3. buffer_inv sc0=1 sc1=1
9172
9173                                                            - Must happen before
9174                                                              any following
9175                                                              global/generic
9176                                                              load/load
9177                                                              atomic/atomicrmw.
9178                                                            - Ensures that
9179                                                              following
9180                                                              loads will not see
9181                                                              stale MTYPE NC global data.
9182                                                              MTYPE RW and CC memory will
9183                                                              never be stale due to the
9184                                                              memory probes.
9185
9186      load atomic  acquire      - agent        - generic  1. flat_load sc1=1
9187                                                          2. s_waitcnt vmcnt(0) &
9188                                                             lgkmcnt(0)
9189
9190                                                            - If TgSplit execution mode,
9191                                                              omit lgkmcnt(0).
9192                                                            - If OpenCL omit
9193                                                              lgkmcnt(0).
9194                                                            - Must happen before
9195                                                              following
9196                                                              buffer_inv.
9197                                                            - Ensures the flat_load
9198                                                              has completed
9199                                                              before invalidating
9200                                                              the cache.
9201
9202                                                          3. buffer_inv sc1=1
9203
9204                                                            - Must happen before
9205                                                              any following
9206                                                              global/generic
9207                                                              load/load
9208                                                              atomic/atomicrmw.
9209                                                            - Ensures that
9210                                                              following loads
9211                                                              will not see stale
9212                                                              global data.
9213
9214      load atomic  acquire      - system       - generic  1. flat_load sc0=1 sc1=1
9215                                                          2. s_waitcnt vmcnt(0) &
9216                                                             lgkmcnt(0)
9217
9218                                                            - If TgSplit execution mode,
9219                                                              omit lgkmcnt(0).
9220                                                            - If OpenCL omit
9221                                                              lgkmcnt(0).
9222                                                            - Must happen before
9223                                                              the following
9224                                                              buffer_inv.
9225                                                            - Ensures the flat_load
9226                                                              has completed
9227                                                              before invalidating
9228                                                              the caches.
9229
9230                                                          3. buffer_inv sc0=1 sc1=1
9231
9232                                                            - Must happen before
9233                                                              any following
9234                                                              global/generic
9235                                                              load/load
9236                                                              atomic/atomicrmw.
9237                                                            - Ensures that
9238                                                              following
9239                                                              loads will not see
9240                                                              stale MTYPE NC global data.
9241                                                              MTYPE RW and CC memory will
9242                                                              never be stale due to the
9243                                                              memory probes.
9244
9245      atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
9246                                - wavefront    - generic
9247      atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
9248                                - wavefront               local address space cannot
9249                                                          be used.*
9250
9251                                                          1. ds_atomic
9252      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
9253                                                          2. s_waitcnt vmcnt(0)
9254
9255                                                            - If not TgSplit execution
9256                                                              mode, omit.
9257                                                            - Must happen before the
9258                                                              following buffer_inv.
9259                                                            - Ensures the atomicrmw
9260                                                              has completed
9261                                                              before invalidating
9262                                                              the cache.
9263
9264                                                          3. buffer_inv sc0=1
9265
9266                                                            - If not TgSplit execution
9267                                                              mode, omit.
9268                                                            - Must happen before
9269                                                              any following
9270                                                              global/generic
9271                                                              load/load
9272                                                              atomic/atomicrmw.
9273                                                            - Ensures that
9274                                                              following loads
9275                                                              will not see stale
9276                                                              global data.
9277
9278      atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
9279                                                          local address space cannot
9280                                                          be used.*
9281
9282                                                          1. ds_atomic
9283                                                          2. s_waitcnt lgkmcnt(0)
9284
9285                                                            - If OpenCL, omit.
9286                                                            - Must happen before
9287                                                              any following
9288                                                              global/generic
9289                                                              load/load
9290                                                              atomic/store/store
9291                                                              atomic/atomicrmw.
9292                                                            - Ensures any
9293                                                              following global
9294                                                              data read is no
9295                                                              older than the local
9296                                                              atomicrmw value
9297                                                              being acquired.
9298
9299      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
9300                                                          2. s_waitcnt lgkm/vmcnt(0)
9301
9302                                                            - Use lgkmcnt(0) if not
9303                                                              TgSplit execution mode
9304                                                              and vmcnt(0) if TgSplit
9305                                                              execution mode.
9306                                                            - If OpenCL, omit lgkmcnt(0).
9307                                                            - Must happen before
9308                                                              the following
9309                                                              buffer_inv and
9310                                                              any following
9311                                                              global/generic
9312                                                              load/load
9313                                                              atomic/store/store
9314                                                              atomic/atomicrmw.
9315                                                            - Ensures any
9316                                                              following global
9317                                                              data read is no
9318                                                              older than a local
9319                                                              atomicrmw value
9320                                                              being acquired.
9321
9322                                                          3. buffer_inv sc0=1
9323
9324                                                            - If not TgSplit execution
9325                                                              mode, omit.
9326                                                            - Ensures that
9327                                                              following
9328                                                              loads will not see
9329                                                              stale data.
9330
9331      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
9332                                                          2. s_waitcnt vmcnt(0)
9333
9334                                                            - Must happen before
9335                                                              following
9336                                                              buffer_inv.
9337                                                            - Ensures the
9338                                                              atomicrmw has
9339                                                              completed before
9340                                                              invalidating the
9341                                                              cache.
9342
9343                                                          3. buffer_inv sc1=1
9344
9345                                                            - Must happen before
9346                                                              any following
9347                                                              global/generic
9348                                                              load/load
9349                                                              atomic/atomicrmw.
9350                                                            - Ensures that
9351                                                              following loads
9352                                                              will not see stale
9353                                                              global data.
9354
9355      atomicrmw    acquire      - system       - global   1. buffer/global_atomic
9356                                                             sc1=1
9357                                                          2. s_waitcnt vmcnt(0)
9358
9359                                                            - Must happen before
9360                                                              following
9361                                                              buffer_inv.
9362                                                            - Ensures the
9363                                                              atomicrmw has
9364                                                              completed before
9365                                                              invalidating the
9366                                                              caches.
9367
9368                                                          3. buffer_inv sc0=1 sc1=1
9369
9370                                                            - Must happen before
9371                                                              any following
9372                                                              global/generic
9373                                                              load/load
9374                                                              atomic/atomicrmw.
9375                                                            - Ensures that
9376                                                              following
9377                                                              loads will not see
9378                                                              stale MTYPE NC global data.
9379                                                              MTYPE RW and CC memory will
9380                                                              never be stale due to the
9381                                                              memory probes.
9382
9383      atomicrmw    acquire      - agent        - generic  1. flat_atomic
9384                                                          2. s_waitcnt vmcnt(0) &
9385                                                             lgkmcnt(0)
9386
9387                                                            - If TgSplit execution mode,
9388                                                              omit lgkmcnt(0).
9389                                                            - If OpenCL, omit
9390                                                              lgkmcnt(0).
9391                                                            - Must happen before
9392                                                              following
9393                                                              buffer_inv.
9394                                                            - Ensures the
9395                                                              atomicrmw has
9396                                                              completed before
9397                                                              invalidating the
9398                                                              cache.
9399
9400                                                          3. buffer_inv sc1=1
9401
9402                                                            - Must happen before
9403                                                              any following
9404                                                              global/generic
9405                                                              load/load
9406                                                              atomic/atomicrmw.
9407                                                            - Ensures that
9408                                                              following loads
9409                                                              will not see stale
9410                                                              global data.
9411
9412      atomicrmw    acquire      - system       - generic  1. flat_atomic sc1=1
9413                                                          2. s_waitcnt vmcnt(0) &
9414                                                             lgkmcnt(0)
9415
9416                                                            - If TgSplit execution mode,
9417                                                              omit lgkmcnt(0).
9418                                                            - If OpenCL, omit
9419                                                              lgkmcnt(0).
9420                                                            - Must happen before
9421                                                              following
9422                                                              buffer_inv.
9423                                                            - Ensures the
9424                                                              atomicrmw has
9425                                                              completed before
9426                                                              invalidating the
9427                                                              caches.
9428
9429                                                          3. buffer_inv sc0=1 sc1=1
9430
9431                                                            - Must happen before
9432                                                              any following
9433                                                              global/generic
9434                                                              load/load
9435                                                              atomic/atomicrmw.
9436                                                            - Ensures that
9437                                                              following
9438                                                              loads will not see
9439                                                              stale MTYPE NC global data.
9440                                                              MTYPE RW and CC memory will
9441                                                              never be stale due to the
9442                                                              memory probes.
9443
9444      fence        acquire      - singlethread *none*     *none*
9445                                - wavefront
9446      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
9447
9448                                                            - Use lgkmcnt(0) if not
9449                                                              TgSplit execution mode
9450                                                              and vmcnt(0) if TgSplit
9451                                                              execution mode.
9452                                                            - If OpenCL and
9453                                                              address space is
9454                                                              not generic, omit
9455                                                              lgkmcnt(0).
9456                                                            - If OpenCL and
9457                                                              address space is
9458                                                              local, omit
9459                                                              vmcnt(0).
9460                                                            - However, since LLVM
9461                                                              currently has no
9462                                                              address space on
9463                                                              the fence need to
9464                                                              conservatively
9465                                                              always generate. If
9466                                                              fence had an
9467                                                              address space then
9468                                                              set to address
9469                                                              space of OpenCL
9470                                                              fence flag, or to
9471                                                              generic if both
9472                                                              local and global
9473                                                              flags are
9474                                                              specified.
9475                                                            - s_waitcnt vmcnt(0)
9476                                                              must happen after
9477                                                              any preceding
9478                                                              global/generic load
9479                                                              atomic/
9480                                                              atomicrmw
9481                                                              with an equal or
9482                                                              wider sync scope
9483                                                              and memory ordering
9484                                                              stronger than
9485                                                              unordered (this is
9486                                                              termed the
9487                                                              fence-paired-atomic).
9488                                                            - s_waitcnt lgkmcnt(0)
9489                                                              must happen after
9490                                                              any preceding
9491                                                              local/generic load
9492                                                              atomic/atomicrmw
9493                                                              with an equal or
9494                                                              wider sync scope
9495                                                              and memory ordering
9496                                                              stronger than
9497                                                              unordered (this is
9498                                                              termed the
9499                                                              fence-paired-atomic).
9500                                                            - Must happen before
9501                                                              the following
9502                                                              buffer_inv and
9503                                                              any following
9504                                                              global/generic
9505                                                              load/load
9506                                                              atomic/store/store
9507                                                              atomic/atomicrmw.
9508                                                            - Ensures any
9509                                                              following global
9510                                                              data read is no
9511                                                              older than the
9512                                                              value read by the
9513                                                              fence-paired-atomic.
9514
9515                                                          3. buffer_inv sc0=1
9516
9517                                                            - If not TgSplit execution
9518                                                              mode, omit.
9519                                                            - Ensures that
9520                                                              following
9521                                                              loads will not see
9522                                                              stale data.
9523
9524      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9525                                                             vmcnt(0)
9526
9527                                                            - If TgSplit execution mode,
9528                                                              omit lgkmcnt(0).
9529                                                            - If OpenCL and
9530                                                              address space is
9531                                                              not generic, omit
9532                                                              lgkmcnt(0).
9533                                                            - However, since LLVM
9534                                                              currently has no
9535                                                              address space on
9536                                                              the fence need to
9537                                                              conservatively
9538                                                              always generate
9539                                                              (see comment for
9540                                                              previous fence).
9541                                                            - Could be split into
9542                                                              separate s_waitcnt
9543                                                              vmcnt(0) and
9544                                                              s_waitcnt
9545                                                              lgkmcnt(0) to allow
9546                                                              them to be
9547                                                              independently moved
9548                                                              according to the
9549                                                              following rules.
9550                                                            - s_waitcnt vmcnt(0)
9551                                                              must happen after
9552                                                              any preceding
9553                                                              global/generic load
9554                                                              atomic/atomicrmw
9555                                                              with an equal or
9556                                                              wider sync scope
9557                                                              and memory ordering
9558                                                              stronger than
9559                                                              unordered (this is
9560                                                              termed the
9561                                                              fence-paired-atomic).
9562                                                            - s_waitcnt lgkmcnt(0)
9563                                                              must happen after
9564                                                              any preceding
9565                                                              local/generic load
9566                                                              atomic/atomicrmw
9567                                                              with an equal or
9568                                                              wider sync scope
9569                                                              and memory ordering
9570                                                              stronger than
9571                                                              unordered (this is
9572                                                              termed the
9573                                                              fence-paired-atomic).
9574                                                            - Must happen before
9575                                                              the following
9576                                                              buffer_inv.
9577                                                            - Ensures that the
9578                                                              fence-paired atomic
9579                                                              has completed
9580                                                              before invalidating
9581                                                              the
9582                                                              cache. Therefore
9583                                                              any following
9584                                                              locations read must
9585                                                              be no older than
9586                                                              the value read by
9587                                                              the
9588                                                              fence-paired-atomic.
9589
9590                                                          2. buffer_inv sc1=1
9591
9592                                                            - Must happen before any
9593                                                              following global/generic
9594                                                              load/load
9595                                                              atomic/store/store
9596                                                              atomic/atomicrmw.
9597                                                            - Ensures that
9598                                                              following loads
9599                                                              will not see stale
9600                                                              global data.
9601
9602      fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
9603                                                             vmcnt(0)
9604
9605                                                            - If TgSplit execution mode,
9606                                                              omit lgkmcnt(0).
9607                                                            - If OpenCL and
9608                                                              address space is
9609                                                              not generic, omit
9610                                                              lgkmcnt(0).
9611                                                            - However, since LLVM
9612                                                              currently has no
9613                                                              address space on
9614                                                              the fence need to
9615                                                              conservatively
9616                                                              always generate
9617                                                              (see comment for
9618                                                              previous fence).
9619                                                            - Could be split into
9620                                                              separate s_waitcnt
9621                                                              vmcnt(0) and
9622                                                              s_waitcnt
9623                                                              lgkmcnt(0) to allow
9624                                                              them to be
9625                                                              independently moved
9626                                                              according to the
9627                                                              following rules.
9628                                                            - s_waitcnt vmcnt(0)
9629                                                              must happen after
9630                                                              any preceding
9631                                                              global/generic load
9632                                                              atomic/atomicrmw
9633                                                              with an equal or
9634                                                              wider sync scope
9635                                                              and memory ordering
9636                                                              stronger than
9637                                                              unordered (this is
9638                                                              termed the
9639                                                              fence-paired-atomic).
9640                                                            - s_waitcnt lgkmcnt(0)
9641                                                              must happen after
9642                                                              any preceding
9643                                                              local/generic load
9644                                                              atomic/atomicrmw
9645                                                              with an equal or
9646                                                              wider sync scope
9647                                                              and memory ordering
9648                                                              stronger than
9649                                                              unordered (this is
9650                                                              termed the
9651                                                              fence-paired-atomic).
9652                                                            - Must happen before
9653                                                              the following
9654                                                              buffer_inv.
9655                                                            - Ensures that the
9656                                                              fence-paired atomic
9657                                                              has completed
9658                                                              before invalidating
9659                                                              the
9660                                                              cache. Therefore
9661                                                              any following
9662                                                              locations read must
9663                                                              be no older than
9664                                                              the value read by
9665                                                              the
9666                                                              fence-paired-atomic.
9667
9668                                                          2. buffer_inv sc0=1 sc1=1
9669
9670                                                            - Must happen before any
9671                                                              following global/generic
9672                                                              load/load
9673                                                              atomic/store/store
9674                                                              atomic/atomicrmw.
9675                                                            - Ensures that
9676                                                              following loads
9677                                                              will not see stale
9678                                                              global data.
9679
9680      **Release Atomic**
9681      ------------------------------------------------------------------------------------
9682      store atomic release      - singlethread - global   1. buffer/global/flat_store
9683                                - wavefront    - generic
9684      store atomic release      - singlethread - local    *If TgSplit execution mode,
9685                                - wavefront               local address space cannot
9686                                                          be used.*
9687
9688                                                          1. ds_store
9689      store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9690                                               - generic
9691                                                            - Use lgkmcnt(0) if not
9692                                                              TgSplit execution mode
9693                                                              and vmcnt(0) if TgSplit
9694                                                              execution mode.
9695                                                            - If OpenCL, omit lgkmcnt(0).
9696                                                            - s_waitcnt vmcnt(0)
9697                                                              must happen after
9698                                                              any preceding
9699                                                              global/generic load/store/
9700                                                              load atomic/store atomic/
9701                                                              atomicrmw.
9702                                                            - s_waitcnt lgkmcnt(0)
9703                                                              must happen after
9704                                                              any preceding
9705                                                              local/generic
9706                                                              load/store/load
9707                                                              atomic/store
9708                                                              atomic/atomicrmw.
9709                                                            - Must happen before
9710                                                              the following
9711                                                              store.
9712                                                            - Ensures that all
9713                                                              memory operations
9714                                                              have
9715                                                              completed before
9716                                                              performing the
9717                                                              store that is being
9718                                                              released.
9719
9720                                                          2. buffer/global/flat_store sc0=1
9721      store atomic release      - workgroup    - local    *If TgSplit execution mode,
9722                                                          local address space cannot
9723                                                          be used.*
9724
9725                                                          1. ds_store
9726      store atomic release      - agent        - global   1. buffer_wbl2 sc1=1
9727                                               - generic
9728                                                            - Must happen before
9729                                                              following s_waitcnt.
9730                                                            - Performs L2 writeback to
9731                                                              ensure previous
9732                                                              global/generic
9733                                                              store/atomicrmw are
9734                                                              visible at agent scope.
9735
9736                                                          2. s_waitcnt lgkmcnt(0) &
9737                                                             vmcnt(0)
9738
9739                                                            - If TgSplit execution mode,
9740                                                              omit lgkmcnt(0).
9741                                                            - If OpenCL and
9742                                                              address space is
9743                                                              not generic, omit
9744                                                              lgkmcnt(0).
9745                                                            - Could be split into
9746                                                              separate s_waitcnt
9747                                                              vmcnt(0) and
9748                                                              s_waitcnt
9749                                                              lgkmcnt(0) to allow
9750                                                              them to be
9751                                                              independently moved
9752                                                              according to the
9753                                                              following rules.
9754                                                            - s_waitcnt vmcnt(0)
9755                                                              must happen after
9756                                                              any preceding
9757                                                              global/generic
9758                                                              load/store/load
9759                                                              atomic/store
9760                                                              atomic/atomicrmw.
9761                                                            - s_waitcnt lgkmcnt(0)
9762                                                              must happen after
9763                                                              any preceding
9764                                                              local/generic
9765                                                              load/store/load
9766                                                              atomic/store
9767                                                              atomic/atomicrmw.
9768                                                            - Must happen before
9769                                                              the following
9770                                                              store.
9771                                                            - Ensures that all
9772                                                              memory operations
9773                                                              to memory have
9774                                                              completed before
9775                                                              performing the
9776                                                              store that is being
9777                                                              released.
9778
9779                                                          3. buffer/global/flat_store sc1=1
9780      store atomic release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
9781                                               - generic
9782                                                            - Must happen before
9783                                                              following s_waitcnt.
9784                                                            - Performs L2 writeback to
9785                                                              ensure previous
9786                                                              global/generic
9787                                                              store/atomicrmw are
9788                                                              visible at system scope.
9789
9790                                                          2. s_waitcnt lgkmcnt(0) &
9791                                                             vmcnt(0)
9792
9793                                                            - If TgSplit execution mode,
9794                                                              omit lgkmcnt(0).
9795                                                            - If OpenCL and
9796                                                              address space is
9797                                                              not generic, omit
9798                                                              lgkmcnt(0).
9799                                                            - Could be split into
9800                                                              separate s_waitcnt
9801                                                              vmcnt(0) and
9802                                                              s_waitcnt
9803                                                              lgkmcnt(0) to allow
9804                                                              them to be
9805                                                              independently moved
9806                                                              according to the
9807                                                              following rules.
9808                                                            - s_waitcnt vmcnt(0)
9809                                                              must happen after any
9810                                                              preceding
9811                                                              global/generic
9812                                                              load/store/load
9813                                                              atomic/store
9814                                                              atomic/atomicrmw.
9815                                                            - s_waitcnt lgkmcnt(0)
9816                                                              must happen after any
9817                                                              preceding
9818                                                              local/generic
9819                                                              load/store/load
9820                                                              atomic/store
9821                                                              atomic/atomicrmw.
9822                                                            - Must happen before
9823                                                              the following
9824                                                              store.
9825                                                            - Ensures that all
9826                                                              memory operations
9827                                                              to memory and the L2
9828                                                              writeback have
9829                                                              completed before
9830                                                              performing the
9831                                                              store that is being
9832                                                              released.
9833
9834                                                          3. buffer/global/flat_store
9835                                                             sc0=1 sc1=1
9836      atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
9837                                - wavefront    - generic
9838      atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
9839                                - wavefront               local address space cannot
9840                                                          be used.*
9841
9842                                                          1. ds_atomic
9843      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
9844                                               - generic
9845                                                            - Use lgkmcnt(0) if not
9846                                                              TgSplit execution mode
9847                                                              and vmcnt(0) if TgSplit
9848                                                              execution mode.
9849                                                            - If OpenCL, omit
9850                                                              lgkmcnt(0).
9851                                                            - s_waitcnt vmcnt(0)
9852                                                              must happen after
9853                                                              any preceding
9854                                                              global/generic load/store/
9855                                                              load atomic/store atomic/
9856                                                              atomicrmw.
9857                                                            - s_waitcnt lgkmcnt(0)
9858                                                              must happen after
9859                                                              any preceding
9860                                                              local/generic
9861                                                              load/store/load
9862                                                              atomic/store
9863                                                              atomic/atomicrmw.
9864                                                            - Must happen before
9865                                                              the following
9866                                                              atomicrmw.
9867                                                            - Ensures that all
9868                                                              memory operations
9869                                                              have
9870                                                              completed before
9871                                                              performing the
9872                                                              atomicrmw that is
9873                                                              being released.
9874
9875                                                          2. buffer/global/flat_atomic sc0=1
9876      atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
9877                                                          local address space cannot
9878                                                          be used.*
9879
9880                                                          1. ds_atomic
9881      atomicrmw    release      - agent        - global   1. buffer_wbl2 sc1=1
9882                                               - generic
9883                                                            - Must happen before
9884                                                              following s_waitcnt.
9885                                                            - Performs L2 writeback to
9886                                                              ensure previous
9887                                                              global/generic
9888                                                              store/atomicrmw are
9889                                                              visible at agent scope.
9890
9891                                                          2. s_waitcnt lgkmcnt(0) &
9892                                                             vmcnt(0)
9893
9894                                                            - If TgSplit execution mode,
9895                                                              omit lgkmcnt(0).
9896                                                            - If OpenCL, omit
9897                                                              lgkmcnt(0).
9898                                                            - Could be split into
9899                                                              separate s_waitcnt
9900                                                              vmcnt(0) and
9901                                                              s_waitcnt
9902                                                              lgkmcnt(0) to allow
9903                                                              them to be
9904                                                              independently moved
9905                                                              according to the
9906                                                              following rules.
9907                                                            - s_waitcnt vmcnt(0)
9908                                                              must happen after
9909                                                              any preceding
9910                                                              global/generic
9911                                                              load/store/load
9912                                                              atomic/store
9913                                                              atomic/atomicrmw.
9914                                                            - s_waitcnt lgkmcnt(0)
9915                                                              must happen after
9916                                                              any preceding
9917                                                              local/generic
9918                                                              load/store/load
9919                                                              atomic/store
9920                                                              atomic/atomicrmw.
9921                                                            - Must happen before
9922                                                              the following
9923                                                              atomicrmw.
9924                                                            - Ensures that all
9925                                                              memory operations
9926                                                              to global and local
9927                                                              have completed
9928                                                              before performing
9929                                                              the atomicrmw that
9930                                                              is being released.
9931
9932                                                          3. buffer/global/flat_atomic sc1=1
9933      atomicrmw    release      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
9934                                               - generic
9935                                                            - Must happen before
9936                                                              following s_waitcnt.
9937                                                            - Performs L2 writeback to
9938                                                              ensure previous
9939                                                              global/generic
9940                                                              store/atomicrmw are
9941                                                              visible at system scope.
9942
9943                                                          2. s_waitcnt lgkmcnt(0) &
9944                                                             vmcnt(0)
9945
9946                                                            - If TgSplit execution mode,
9947                                                              omit lgkmcnt(0).
9948                                                            - If OpenCL, omit
9949                                                              lgkmcnt(0).
9950                                                            - Could be split into
9951                                                              separate s_waitcnt
9952                                                              vmcnt(0) and
9953                                                              s_waitcnt
9954                                                              lgkmcnt(0) to allow
9955                                                              them to be
9956                                                              independently moved
9957                                                              according to the
9958                                                              following rules.
9959                                                            - s_waitcnt vmcnt(0)
9960                                                              must happen after
9961                                                              any preceding
9962                                                              global/generic
9963                                                              load/store/load
9964                                                              atomic/store
9965                                                              atomic/atomicrmw.
9966                                                            - s_waitcnt lgkmcnt(0)
9967                                                              must happen after
9968                                                              any preceding
9969                                                              local/generic
9970                                                              load/store/load
9971                                                              atomic/store
9972                                                              atomic/atomicrmw.
9973                                                            - Must happen before
9974                                                              the following
9975                                                              atomicrmw.
9976                                                            - Ensures that all
9977                                                              memory operations
9978                                                              to memory and the L2
9979                                                              writeback have
9980                                                              completed before
9981                                                              performing the
9982                                                              store that is being
9983                                                              released.
9984
9985                                                          3. buffer/global/flat_atomic
9986                                                             sc0=1 sc1=1
9987      fence        release      - singlethread *none*     *none*
9988                                - wavefront
9989      fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
9990
9991                                                            - Use lgkmcnt(0) if not
9992                                                              TgSplit execution mode
9993                                                              and vmcnt(0) if TgSplit
9994                                                              execution mode.
9995                                                            - If OpenCL and
9996                                                              address space is
9997                                                              not generic, omit
9998                                                              lgkmcnt(0).
9999                                                            - If OpenCL and
10000                                                              address space is
10001                                                              local, omit
10002                                                              vmcnt(0).
10003                                                            - However, since LLVM
10004                                                              currently has no
10005                                                              address space on
10006                                                              the fence need to
10007                                                              conservatively
10008                                                              always generate. If
10009                                                              fence had an
10010                                                              address space then
10011                                                              set to address
10012                                                              space of OpenCL
10013                                                              fence flag, or to
10014                                                              generic if both
10015                                                              local and global
10016                                                              flags are
10017                                                              specified.
10018                                                            - s_waitcnt vmcnt(0)
10019                                                              must happen after
10020                                                              any preceding
10021                                                              global/generic
10022                                                              load/store/
10023                                                              load atomic/store atomic/
10024                                                              atomicrmw.
10025                                                            - s_waitcnt lgkmcnt(0)
10026                                                              must happen after
10027                                                              any preceding
10028                                                              local/generic
10029                                                              load/load
10030                                                              atomic/store/store
10031                                                              atomic/atomicrmw.
10032                                                            - Must happen before
10033                                                              any following store
10034                                                              atomic/atomicrmw
10035                                                              with an equal or
10036                                                              wider sync scope
10037                                                              and memory ordering
10038                                                              stronger than
10039                                                              unordered (this is
10040                                                              termed the
10041                                                              fence-paired-atomic).
10042                                                            - Ensures that all
10043                                                              memory operations
10044                                                              have
10045                                                              completed before
10046                                                              performing the
10047                                                              following
10048                                                              fence-paired-atomic.
10049
10050      fence        release      - agent        *none*     1. buffer_wbl2 sc1=1
10051
10052                                                            - If OpenCL and
10053                                                              address space is
10054                                                              local, omit.
10055                                                            - Must happen before
10056                                                              following s_waitcnt.
10057                                                            - Performs L2 writeback to
10058                                                              ensure previous
10059                                                              global/generic
10060                                                              store/atomicrmw are
10061                                                              visible at agent scope.
10062
10063                                                          2. s_waitcnt lgkmcnt(0) &
10064                                                             vmcnt(0)
10065
10066                                                            - If TgSplit execution mode,
10067                                                              omit lgkmcnt(0).
10068                                                            - If OpenCL and
10069                                                              address space is
10070                                                              not generic, omit
10071                                                              lgkmcnt(0).
10072                                                            - If OpenCL and
10073                                                              address space is
10074                                                              local, omit
10075                                                              vmcnt(0).
10076                                                            - However, since LLVM
10077                                                              currently has no
10078                                                              address space on
10079                                                              the fence need to
10080                                                              conservatively
10081                                                              always generate. If
10082                                                              fence had an
10083                                                              address space then
10084                                                              set to address
10085                                                              space of OpenCL
10086                                                              fence flag, or to
10087                                                              generic if both
10088                                                              local and global
10089                                                              flags are
10090                                                              specified.
10091                                                            - Could be split into
10092                                                              separate s_waitcnt
10093                                                              vmcnt(0) and
10094                                                              s_waitcnt
10095                                                              lgkmcnt(0) to allow
10096                                                              them to be
10097                                                              independently moved
10098                                                              according to the
10099                                                              following rules.
10100                                                            - s_waitcnt vmcnt(0)
10101                                                              must happen after
10102                                                              any preceding
10103                                                              global/generic
10104                                                              load/store/load
10105                                                              atomic/store
10106                                                              atomic/atomicrmw.
10107                                                            - s_waitcnt lgkmcnt(0)
10108                                                              must happen after
10109                                                              any preceding
10110                                                              local/generic
10111                                                              load/store/load
10112                                                              atomic/store
10113                                                              atomic/atomicrmw.
10114                                                            - Must happen before
10115                                                              any following store
10116                                                              atomic/atomicrmw
10117                                                              with an equal or
10118                                                              wider sync scope
10119                                                              and memory ordering
10120                                                              stronger than
10121                                                              unordered (this is
10122                                                              termed the
10123                                                              fence-paired-atomic).
10124                                                            - Ensures that all
10125                                                              memory operations
10126                                                              have
10127                                                              completed before
10128                                                              performing the
10129                                                              following
10130                                                              fence-paired-atomic.
10131
10132      fence        release      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
10133
10134                                                            - Must happen before
10135                                                              following s_waitcnt.
10136                                                            - Performs L2 writeback to
10137                                                              ensure previous
10138                                                              global/generic
10139                                                              store/atomicrmw are
10140                                                              visible at system scope.
10141
10142                                                          2. s_waitcnt lgkmcnt(0) &
10143                                                             vmcnt(0)
10144
10145                                                            - If TgSplit execution mode,
10146                                                              omit lgkmcnt(0).
10147                                                            - If OpenCL and
10148                                                              address space is
10149                                                              not generic, omit
10150                                                              lgkmcnt(0).
10151                                                            - If OpenCL and
10152                                                              address space is
10153                                                              local, omit
10154                                                              vmcnt(0).
10155                                                            - However, since LLVM
10156                                                              currently has no
10157                                                              address space on
10158                                                              the fence need to
10159                                                              conservatively
10160                                                              always generate. If
10161                                                              fence had an
10162                                                              address space then
10163                                                              set to address
10164                                                              space of OpenCL
10165                                                              fence flag, or to
10166                                                              generic if both
10167                                                              local and global
10168                                                              flags are
10169                                                              specified.
10170                                                            - Could be split into
10171                                                              separate s_waitcnt
10172                                                              vmcnt(0) and
10173                                                              s_waitcnt
10174                                                              lgkmcnt(0) to allow
10175                                                              them to be
10176                                                              independently moved
10177                                                              according to the
10178                                                              following rules.
10179                                                            - s_waitcnt vmcnt(0)
10180                                                              must happen after
10181                                                              any preceding
10182                                                              global/generic
10183                                                              load/store/load
10184                                                              atomic/store
10185                                                              atomic/atomicrmw.
10186                                                            - s_waitcnt lgkmcnt(0)
10187                                                              must happen after
10188                                                              any preceding
10189                                                              local/generic
10190                                                              load/store/load
10191                                                              atomic/store
10192                                                              atomic/atomicrmw.
10193                                                            - Must happen before
10194                                                              any following store
10195                                                              atomic/atomicrmw
10196                                                              with an equal or
10197                                                              wider sync scope
10198                                                              and memory ordering
10199                                                              stronger than
10200                                                              unordered (this is
10201                                                              termed the
10202                                                              fence-paired-atomic).
10203                                                            - Ensures that all
10204                                                              memory operations
10205                                                              have
10206                                                              completed before
10207                                                              performing the
10208                                                              following
10209                                                              fence-paired-atomic.
10210
10211      **Acquire-Release Atomic**
10212      ------------------------------------------------------------------------------------
10213      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
10214                                - wavefront    - generic
10215      atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
10216                                - wavefront               local address space cannot
10217                                                          be used.*
10218
10219                                                          1. ds_atomic
10220      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
10221
10222                                                            - Use lgkmcnt(0) if not
10223                                                              TgSplit execution mode
10224                                                              and vmcnt(0) if TgSplit
10225                                                              execution mode.
10226                                                            - If OpenCL, omit
10227                                                              lgkmcnt(0).
10228                                                            - Must happen after
10229                                                              any preceding
10230                                                              local/generic
10231                                                              load/store/load
10232                                                              atomic/store
10233                                                              atomic/atomicrmw.
10234                                                            - s_waitcnt vmcnt(0)
10235                                                              must happen after
10236                                                              any preceding
10237                                                              global/generic load/store/
10238                                                              load atomic/store atomic/
10239                                                              atomicrmw.
10240                                                            - s_waitcnt lgkmcnt(0)
10241                                                              must happen after
10242                                                              any preceding
10243                                                              local/generic
10244                                                              load/store/load
10245                                                              atomic/store
10246                                                              atomic/atomicrmw.
10247                                                            - Must happen before
10248                                                              the following
10249                                                              atomicrmw.
10250                                                            - Ensures that all
10251                                                              memory operations
10252                                                              have
10253                                                              completed before
10254                                                              performing the
10255                                                              atomicrmw that is
10256                                                              being released.
10257
10258                                                          2. buffer/global_atomic
10259                                                          3. s_waitcnt vmcnt(0)
10260
10261                                                            - If not TgSplit execution
10262                                                              mode, omit.
10263                                                            - Must happen before
10264                                                              the following
10265                                                              buffer_inv.
10266                                                            - Ensures any
10267                                                              following global
10268                                                              data read is no
10269                                                              older than the
10270                                                              atomicrmw value
10271                                                              being acquired.
10272
10273                                                          4. buffer_inv sc0=1
10274
10275                                                            - If not TgSplit execution
10276                                                              mode, omit.
10277                                                            - Ensures that
10278                                                              following
10279                                                              loads will not see
10280                                                              stale data.
10281
10282      atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
10283                                                          local address space cannot
10284                                                          be used.*
10285
10286                                                          1. ds_atomic
10287                                                          2. s_waitcnt lgkmcnt(0)
10288
10289                                                            - If OpenCL, omit.
10290                                                            - Must happen before
10291                                                              any following
10292                                                              global/generic
10293                                                              load/load
10294                                                              atomic/store/store
10295                                                              atomic/atomicrmw.
10296                                                            - Ensures any
10297                                                              following global
10298                                                              data read is no
10299                                                              older than the local load
10300                                                              atomic value being
10301                                                              acquired.
10302
10303      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
10304
10305                                                            - Use lgkmcnt(0) if not
10306                                                              TgSplit execution mode
10307                                                              and vmcnt(0) if TgSplit
10308                                                              execution mode.
10309                                                            - If OpenCL, omit
10310                                                              lgkmcnt(0).
10311                                                            - s_waitcnt vmcnt(0)
10312                                                              must happen after
10313                                                              any preceding
10314                                                              global/generic load/store/
10315                                                              load atomic/store atomic/
10316                                                              atomicrmw.
10317                                                            - s_waitcnt lgkmcnt(0)
10318                                                              must happen after
10319                                                              any preceding
10320                                                              local/generic
10321                                                              load/store/load
10322                                                              atomic/store
10323                                                              atomic/atomicrmw.
10324                                                            - Must happen before
10325                                                              the following
10326                                                              atomicrmw.
10327                                                            - Ensures that all
10328                                                              memory operations
10329                                                              have
10330                                                              completed before
10331                                                              performing the
10332                                                              atomicrmw that is
10333                                                              being released.
10334
10335                                                          2. flat_atomic
10336                                                          3. s_waitcnt lgkmcnt(0) &
10337                                                             vmcnt(0)
10338
10339                                                            - If not TgSplit execution
10340                                                              mode, omit vmcnt(0).
10341                                                            - If OpenCL, omit
10342                                                              lgkmcnt(0).
10343                                                            - Must happen before
10344                                                              the following
10345                                                              buffer_inv and
10346                                                              any following
10347                                                              global/generic
10348                                                              load/load
10349                                                              atomic/store/store
10350                                                              atomic/atomicrmw.
10351                                                            - Ensures any
10352                                                              following global
10353                                                              data read is no
10354                                                              older than a local load
10355                                                              atomic value being
10356                                                              acquired.
10357
10358                                                          3. buffer_inv sc0=1
10359
10360                                                            - If not TgSplit execution
10361                                                              mode, omit.
10362                                                            - Ensures that
10363                                                              following
10364                                                              loads will not see
10365                                                              stale data.
10366
10367      atomicrmw    acq_rel      - agent        - global   1. buffer_wbl2 sc1=1
10368
10369                                                            - Must happen before
10370                                                              following s_waitcnt.
10371                                                            - Performs L2 writeback to
10372                                                              ensure previous
10373                                                              global/generic
10374                                                              store/atomicrmw are
10375                                                              visible at agent scope.
10376
10377                                                          2. s_waitcnt lgkmcnt(0) &
10378                                                             vmcnt(0)
10379
10380                                                            - If TgSplit execution mode,
10381                                                              omit lgkmcnt(0).
10382                                                            - If OpenCL, omit
10383                                                              lgkmcnt(0).
10384                                                            - Could be split into
10385                                                              separate s_waitcnt
10386                                                              vmcnt(0) and
10387                                                              s_waitcnt
10388                                                              lgkmcnt(0) to allow
10389                                                              them to be
10390                                                              independently moved
10391                                                              according to the
10392                                                              following rules.
10393                                                            - s_waitcnt vmcnt(0)
10394                                                              must happen after
10395                                                              any preceding
10396                                                              global/generic
10397                                                              load/store/load
10398                                                              atomic/store
10399                                                              atomic/atomicrmw.
10400                                                            - s_waitcnt lgkmcnt(0)
10401                                                              must happen after
10402                                                              any preceding
10403                                                              local/generic
10404                                                              load/store/load
10405                                                              atomic/store
10406                                                              atomic/atomicrmw.
10407                                                            - Must happen before
10408                                                              the following
10409                                                              atomicrmw.
10410                                                            - Ensures that all
10411                                                              memory operations
10412                                                              to global have
10413                                                              completed before
10414                                                              performing the
10415                                                              atomicrmw that is
10416                                                              being released.
10417
10418                                                          3. buffer/global_atomic
10419                                                          4. s_waitcnt vmcnt(0)
10420
10421                                                            - Must happen before
10422                                                              following
10423                                                              buffer_inv.
10424                                                            - Ensures the
10425                                                              atomicrmw has
10426                                                              completed before
10427                                                              invalidating the
10428                                                              cache.
10429
10430                                                          5. buffer_inv sc1=1
10431
10432                                                            - Must happen before
10433                                                              any following
10434                                                              global/generic
10435                                                              load/load
10436                                                              atomic/atomicrmw.
10437                                                            - Ensures that
10438                                                              following loads
10439                                                              will not see stale
10440                                                              global data.
10441
10442      atomicrmw    acq_rel      - system       - global   1. buffer_wbl2 sc0=1 sc1=1
10443
10444                                                            - Must happen before
10445                                                              following s_waitcnt.
10446                                                            - Performs L2 writeback to
10447                                                              ensure previous
10448                                                              global/generic
10449                                                              store/atomicrmw are
10450                                                              visible at system scope.
10451
10452                                                          2. s_waitcnt lgkmcnt(0) &
10453                                                             vmcnt(0)
10454
10455                                                            - If TgSplit execution mode,
10456                                                              omit lgkmcnt(0).
10457                                                            - If OpenCL, omit
10458                                                              lgkmcnt(0).
10459                                                            - Could be split into
10460                                                              separate s_waitcnt
10461                                                              vmcnt(0) and
10462                                                              s_waitcnt
10463                                                              lgkmcnt(0) to allow
10464                                                              them to be
10465                                                              independently moved
10466                                                              according to the
10467                                                              following rules.
10468                                                            - s_waitcnt vmcnt(0)
10469                                                              must happen after
10470                                                              any preceding
10471                                                              global/generic
10472                                                              load/store/load
10473                                                              atomic/store
10474                                                              atomic/atomicrmw.
10475                                                            - s_waitcnt lgkmcnt(0)
10476                                                              must happen after
10477                                                              any preceding
10478                                                              local/generic
10479                                                              load/store/load
10480                                                              atomic/store
10481                                                              atomic/atomicrmw.
10482                                                            - Must happen before
10483                                                              the following
10484                                                              atomicrmw.
10485                                                            - Ensures that all
10486                                                              memory operations
10487                                                              to global and L2 writeback
10488                                                              have completed before
10489                                                              performing the
10490                                                              atomicrmw that is
10491                                                              being released.
10492
10493                                                          3. buffer/global_atomic
10494                                                             sc1=1
10495                                                          4. s_waitcnt vmcnt(0)
10496
10497                                                            - Must happen before
10498                                                              following
10499                                                              buffer_inv.
10500                                                            - Ensures the
10501                                                              atomicrmw has
10502                                                              completed before
10503                                                              invalidating the
10504                                                              caches.
10505
10506                                                          5. buffer_inv sc0=1 sc1=1
10507
10508                                                            - Must happen before
10509                                                              any following
10510                                                              global/generic
10511                                                              load/load
10512                                                              atomic/atomicrmw.
10513                                                            - Ensures that
10514                                                              following loads
10515                                                              will not see stale
10516                                                              MTYPE NC global data.
10517                                                              MTYPE RW and CC memory will
10518                                                              never be stale due to the
10519                                                              memory probes.
10520
10521      atomicrmw    acq_rel      - agent        - generic  1. buffer_wbl2 sc1=1
10522
10523                                                            - Must happen before
10524                                                              following s_waitcnt.
10525                                                            - Performs L2 writeback to
10526                                                              ensure previous
10527                                                              global/generic
10528                                                              store/atomicrmw are
10529                                                              visible at agent scope.
10530
10531                                                          2. s_waitcnt lgkmcnt(0) &
10532                                                             vmcnt(0)
10533
10534                                                            - If TgSplit execution mode,
10535                                                              omit lgkmcnt(0).
10536                                                            - If OpenCL, omit
10537                                                              lgkmcnt(0).
10538                                                            - Could be split into
10539                                                              separate s_waitcnt
10540                                                              vmcnt(0) and
10541                                                              s_waitcnt
10542                                                              lgkmcnt(0) to allow
10543                                                              them to be
10544                                                              independently moved
10545                                                              according to the
10546                                                              following rules.
10547                                                            - s_waitcnt vmcnt(0)
10548                                                              must happen after
10549                                                              any preceding
10550                                                              global/generic
10551                                                              load/store/load
10552                                                              atomic/store
10553                                                              atomic/atomicrmw.
10554                                                            - s_waitcnt lgkmcnt(0)
10555                                                              must happen after
10556                                                              any preceding
10557                                                              local/generic
10558                                                              load/store/load
10559                                                              atomic/store
10560                                                              atomic/atomicrmw.
10561                                                            - Must happen before
10562                                                              the following
10563                                                              atomicrmw.
10564                                                            - Ensures that all
10565                                                              memory operations
10566                                                              to global have
10567                                                              completed before
10568                                                              performing the
10569                                                              atomicrmw that is
10570                                                              being released.
10571
10572                                                          3. flat_atomic
10573                                                          4. s_waitcnt vmcnt(0) &
10574                                                             lgkmcnt(0)
10575
10576                                                            - If TgSplit execution mode,
10577                                                              omit lgkmcnt(0).
10578                                                            - If OpenCL, omit
10579                                                              lgkmcnt(0).
10580                                                            - Must happen before
10581                                                              following
10582                                                              buffer_inv.
10583                                                            - Ensures the
10584                                                              atomicrmw has
10585                                                              completed before
10586                                                              invalidating the
10587                                                              cache.
10588
10589                                                          5. buffer_inv sc1=1
10590
10591                                                            - Must happen before
10592                                                              any following
10593                                                              global/generic
10594                                                              load/load
10595                                                              atomic/atomicrmw.
10596                                                            - Ensures that
10597                                                              following loads
10598                                                              will not see stale
10599                                                              global data.
10600
10601      atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2 sc0=1 sc1=1
10602
10603                                                            - Must happen before
10604                                                              following s_waitcnt.
10605                                                            - Performs L2 writeback to
10606                                                              ensure previous
10607                                                              global/generic
10608                                                              store/atomicrmw are
10609                                                              visible at system scope.
10610
10611                                                          2. s_waitcnt lgkmcnt(0) &
10612                                                             vmcnt(0)
10613
10614                                                            - If TgSplit execution mode,
10615                                                              omit lgkmcnt(0).
10616                                                            - If OpenCL, omit
10617                                                              lgkmcnt(0).
10618                                                            - Could be split into
10619                                                              separate s_waitcnt
10620                                                              vmcnt(0) and
10621                                                              s_waitcnt
10622                                                              lgkmcnt(0) to allow
10623                                                              them to be
10624                                                              independently moved
10625                                                              according to the
10626                                                              following rules.
10627                                                            - s_waitcnt vmcnt(0)
10628                                                              must happen after
10629                                                              any preceding
10630                                                              global/generic
10631                                                              load/store/load
10632                                                              atomic/store
10633                                                              atomic/atomicrmw.
10634                                                            - s_waitcnt lgkmcnt(0)
10635                                                              must happen after
10636                                                              any preceding
10637                                                              local/generic
10638                                                              load/store/load
10639                                                              atomic/store
10640                                                              atomic/atomicrmw.
10641                                                            - Must happen before
10642                                                              the following
10643                                                              atomicrmw.
10644                                                            - Ensures that all
10645                                                              memory operations
10646                                                              to global and L2 writeback
10647                                                              have completed before
10648                                                              performing the
10649                                                              atomicrmw that is
10650                                                              being released.
10651
10652                                                          3. flat_atomic sc1=1
10653                                                          4. s_waitcnt vmcnt(0) &
10654                                                             lgkmcnt(0)
10655
10656                                                            - If TgSplit execution mode,
10657                                                              omit lgkmcnt(0).
10658                                                            - If OpenCL, omit
10659                                                              lgkmcnt(0).
10660                                                            - Must happen before
10661                                                              following
10662                                                              buffer_inv.
10663                                                            - Ensures the
10664                                                              atomicrmw has
10665                                                              completed before
10666                                                              invalidating the
10667                                                              caches.
10668
10669                                                          5. buffer_inv sc0=1 sc1=1
10670
10671                                                            - Must happen before
10672                                                              any following
10673                                                              global/generic
10674                                                              load/load
10675                                                              atomic/atomicrmw.
10676                                                            - Ensures that
10677                                                              following loads
10678                                                              will not see stale
10679                                                              MTYPE NC global data.
10680                                                              MTYPE RW and CC memory will
10681                                                              never be stale due to the
10682                                                              memory probes.
10683
10684      fence        acq_rel      - singlethread *none*     *none*
10685                                - wavefront
10686      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
10687
10688                                                            - Use lgkmcnt(0) if not
10689                                                              TgSplit execution mode
10690                                                              and vmcnt(0) if TgSplit
10691                                                              execution mode.
10692                                                            - If OpenCL and
10693                                                              address space is
10694                                                              not generic, omit
10695                                                              lgkmcnt(0).
10696                                                            - If OpenCL and
10697                                                              address space is
10698                                                              local, omit
10699                                                              vmcnt(0).
10700                                                            - However,
10701                                                              since LLVM
10702                                                              currently has no
10703                                                              address space on
10704                                                              the fence need to
10705                                                              conservatively
10706                                                              always generate
10707                                                              (see comment for
10708                                                              previous fence).
10709                                                            - s_waitcnt vmcnt(0)
10710                                                              must happen after
10711                                                              any preceding
10712                                                              global/generic
10713                                                              load/store/
10714                                                              load atomic/store atomic/
10715                                                              atomicrmw.
10716                                                            - s_waitcnt lgkmcnt(0)
10717                                                              must happen after
10718                                                              any preceding
10719                                                              local/generic
10720                                                              load/load
10721                                                              atomic/store/store
10722                                                              atomic/atomicrmw.
10723                                                            - Must happen before
10724                                                              any following
10725                                                              global/generic
10726                                                              load/load
10727                                                              atomic/store/store
10728                                                              atomic/atomicrmw.
10729                                                            - Ensures that all
10730                                                              memory operations
10731                                                              have
10732                                                              completed before
10733                                                              performing any
10734                                                              following global
10735                                                              memory operations.
10736                                                            - Ensures that the
10737                                                              preceding
10738                                                              local/generic load
10739                                                              atomic/atomicrmw
10740                                                              with an equal or
10741                                                              wider sync scope
10742                                                              and memory ordering
10743                                                              stronger than
10744                                                              unordered (this is
10745                                                              termed the
10746                                                              acquire-fence-paired-atomic)
10747                                                              has completed
10748                                                              before following
10749                                                              global memory
10750                                                              operations. This
10751                                                              satisfies the
10752                                                              requirements of
10753                                                              acquire.
10754                                                            - Ensures that all
10755                                                              previous memory
10756                                                              operations have
10757                                                              completed before a
10758                                                              following
10759                                                              local/generic store
10760                                                              atomic/atomicrmw
10761                                                              with an equal or
10762                                                              wider sync scope
10763                                                              and memory ordering
10764                                                              stronger than
10765                                                              unordered (this is
10766                                                              termed the
10767                                                              release-fence-paired-atomic).
10768                                                              This satisfies the
10769                                                              requirements of
10770                                                              release.
10771                                                            - Must happen before
10772                                                              the following
10773                                                              buffer_inv.
10774                                                            - Ensures that the
10775                                                              acquire-fence-paired
10776                                                              atomic has completed
10777                                                              before invalidating
10778                                                              the
10779                                                              cache. Therefore
10780                                                              any following
10781                                                              locations read must
10782                                                              be no older than
10783                                                              the value read by
10784                                                              the
10785                                                              acquire-fence-paired-atomic.
10786
10787                                                          3. buffer_inv sc0=1
10788
10789                                                            - If not TgSplit execution
10790                                                              mode, omit.
10791                                                            - Ensures that
10792                                                              following
10793                                                              loads will not see
10794                                                              stale data.
10795
10796      fence        acq_rel      - agent        *none*     1. buffer_wbl2 sc1=1
10797
10798                                                            - If OpenCL and
10799                                                              address space is
10800                                                              local, omit.
10801                                                            - Must happen before
10802                                                              following s_waitcnt.
10803                                                            - Performs L2 writeback to
10804                                                              ensure previous
10805                                                              global/generic
10806                                                              store/atomicrmw are
10807                                                              visible at agent scope.
10808
10809                                                          2. s_waitcnt lgkmcnt(0) &
10810                                                             vmcnt(0)
10811
10812                                                            - If TgSplit execution mode,
10813                                                              omit lgkmcnt(0).
10814                                                            - If OpenCL and
10815                                                              address space is
10816                                                              not generic, omit
10817                                                              lgkmcnt(0).
10818                                                            - However, since LLVM
10819                                                              currently has no
10820                                                              address space on
10821                                                              the fence need to
10822                                                              conservatively
10823                                                              always generate
10824                                                              (see comment for
10825                                                              previous fence).
10826                                                            - Could be split into
10827                                                              separate s_waitcnt
10828                                                              vmcnt(0) and
10829                                                              s_waitcnt
10830                                                              lgkmcnt(0) to allow
10831                                                              them to be
10832                                                              independently moved
10833                                                              according to the
10834                                                              following rules.
10835                                                            - s_waitcnt vmcnt(0)
10836                                                              must happen after
10837                                                              any preceding
10838                                                              global/generic
10839                                                              load/store/load
10840                                                              atomic/store
10841                                                              atomic/atomicrmw.
10842                                                            - s_waitcnt lgkmcnt(0)
10843                                                              must happen after
10844                                                              any preceding
10845                                                              local/generic
10846                                                              load/store/load
10847                                                              atomic/store
10848                                                              atomic/atomicrmw.
10849                                                            - Must happen before
10850                                                              the following
10851                                                              buffer_inv.
10852                                                            - Ensures that the
10853                                                              preceding
10854                                                              global/local/generic
10855                                                              load
10856                                                              atomic/atomicrmw
10857                                                              with an equal or
10858                                                              wider sync scope
10859                                                              and memory ordering
10860                                                              stronger than
10861                                                              unordered (this is
10862                                                              termed the
10863                                                              acquire-fence-paired-atomic)
10864                                                              has completed
10865                                                              before invalidating
10866                                                              the cache. This
10867                                                              satisfies the
10868                                                              requirements of
10869                                                              acquire.
10870                                                            - Ensures that all
10871                                                              previous memory
10872                                                              operations have
10873                                                              completed before a
10874                                                              following
10875                                                              global/local/generic
10876                                                              store
10877                                                              atomic/atomicrmw
10878                                                              with an equal or
10879                                                              wider sync scope
10880                                                              and memory ordering
10881                                                              stronger than
10882                                                              unordered (this is
10883                                                              termed the
10884                                                              release-fence-paired-atomic).
10885                                                              This satisfies the
10886                                                              requirements of
10887                                                              release.
10888
10889                                                          3. buffer_inv sc1=1
10890
10891                                                            - Must happen before
10892                                                              any following
10893                                                              global/generic
10894                                                              load/load
10895                                                              atomic/store/store
10896                                                              atomic/atomicrmw.
10897                                                            - Ensures that
10898                                                              following loads
10899                                                              will not see stale
10900                                                              global data. This
10901                                                              satisfies the
10902                                                              requirements of
10903                                                              acquire.
10904
10905      fence        acq_rel      - system       *none*     1. buffer_wbl2 sc0=1 sc1=1
10906
10907                                                            - If OpenCL and
10908                                                              address space is
10909                                                              local, omit.
10910                                                            - Must happen before
10911                                                              following s_waitcnt.
10912                                                            - Performs L2 writeback to
10913                                                              ensure previous
10914                                                              global/generic
10915                                                              store/atomicrmw are
10916                                                              visible at system scope.
10917
10918                                                          1. s_waitcnt lgkmcnt(0) &
10919                                                             vmcnt(0)
10920
10921                                                            - If TgSplit execution mode,
10922                                                              omit lgkmcnt(0).
10923                                                            - If OpenCL and
10924                                                              address space is
10925                                                              not generic, omit
10926                                                              lgkmcnt(0).
10927                                                            - However, since LLVM
10928                                                              currently has no
10929                                                              address space on
10930                                                              the fence need to
10931                                                              conservatively
10932                                                              always generate
10933                                                              (see comment for
10934                                                              previous fence).
10935                                                            - Could be split into
10936                                                              separate s_waitcnt
10937                                                              vmcnt(0) and
10938                                                              s_waitcnt
10939                                                              lgkmcnt(0) to allow
10940                                                              them to be
10941                                                              independently moved
10942                                                              according to the
10943                                                              following rules.
10944                                                            - s_waitcnt vmcnt(0)
10945                                                              must happen after
10946                                                              any preceding
10947                                                              global/generic
10948                                                              load/store/load
10949                                                              atomic/store
10950                                                              atomic/atomicrmw.
10951                                                            - s_waitcnt lgkmcnt(0)
10952                                                              must happen after
10953                                                              any preceding
10954                                                              local/generic
10955                                                              load/store/load
10956                                                              atomic/store
10957                                                              atomic/atomicrmw.
10958                                                            - Must happen before
10959                                                              the following
10960                                                              buffer_inv.
10961                                                            - Ensures that the
10962                                                              preceding
10963                                                              global/local/generic
10964                                                              load
10965                                                              atomic/atomicrmw
10966                                                              with an equal or
10967                                                              wider sync scope
10968                                                              and memory ordering
10969                                                              stronger than
10970                                                              unordered (this is
10971                                                              termed the
10972                                                              acquire-fence-paired-atomic)
10973                                                              has completed
10974                                                              before invalidating
10975                                                              the cache. This
10976                                                              satisfies the
10977                                                              requirements of
10978                                                              acquire.
10979                                                            - Ensures that all
10980                                                              previous memory
10981                                                              operations have
10982                                                              completed before a
10983                                                              following
10984                                                              global/local/generic
10985                                                              store
10986                                                              atomic/atomicrmw
10987                                                              with an equal or
10988                                                              wider sync scope
10989                                                              and memory ordering
10990                                                              stronger than
10991                                                              unordered (this is
10992                                                              termed the
10993                                                              release-fence-paired-atomic).
10994                                                              This satisfies the
10995                                                              requirements of
10996                                                              release.
10997
10998                                                          2. buffer_inv sc0=1 sc1=1
10999
11000                                                            - Must happen before
11001                                                              any following
11002                                                              global/generic
11003                                                              load/load
11004                                                              atomic/store/store
11005                                                              atomic/atomicrmw.
11006                                                            - Ensures that
11007                                                              following loads
11008                                                              will not see stale
11009                                                              MTYPE NC global data.
11010                                                              MTYPE RW and CC memory will
11011                                                              never be stale due to the
11012                                                              memory probes.
11013
11014      **Sequential Consistent Atomic**
11015      ------------------------------------------------------------------------------------
11016      load atomic  seq_cst      - singlethread - global   *Same as corresponding
11017                                - wavefront    - local    load atomic acquire,
11018                                               - generic  except must generate
11019                                                          all instructions even
11020                                                          for OpenCL.*
11021      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
11022                                               - generic
11023                                                            - Use lgkmcnt(0) if not
11024                                                              TgSplit execution mode
11025                                                              and vmcnt(0) if TgSplit
11026                                                              execution mode.
11027                                                            - s_waitcnt lgkmcnt(0) must
11028                                                              happen after
11029                                                              preceding
11030                                                              local/generic load
11031                                                              atomic/store
11032                                                              atomic/atomicrmw
11033                                                              with memory
11034                                                              ordering of seq_cst
11035                                                              and with equal or
11036                                                              wider sync scope.
11037                                                              (Note that seq_cst
11038                                                              fences have their
11039                                                              own s_waitcnt
11040                                                              lgkmcnt(0) and so do
11041                                                              not need to be
11042                                                              considered.)
11043                                                            - s_waitcnt vmcnt(0)
11044                                                              must happen after
11045                                                              preceding
11046                                                              global/generic load
11047                                                              atomic/store
11048                                                              atomic/atomicrmw
11049                                                              with memory
11050                                                              ordering of seq_cst
11051                                                              and with equal or
11052                                                              wider sync scope.
11053                                                              (Note that seq_cst
11054                                                              fences have their
11055                                                              own s_waitcnt
11056                                                              vmcnt(0) and so do
11057                                                              not need to be
11058                                                              considered.)
11059                                                            - Ensures any
11060                                                              preceding
11061                                                              sequential
11062                                                              consistent global/local
11063                                                              memory instructions
11064                                                              have completed
11065                                                              before executing
11066                                                              this sequentially
11067                                                              consistent
11068                                                              instruction. This
11069                                                              prevents reordering
11070                                                              a seq_cst store
11071                                                              followed by a
11072                                                              seq_cst load. (Note
11073                                                              that seq_cst is
11074                                                              stronger than
11075                                                              acquire/release as
11076                                                              the reordering of
11077                                                              load acquire
11078                                                              followed by a store
11079                                                              release is
11080                                                              prevented by the
11081                                                              s_waitcnt of
11082                                                              the release, but
11083                                                              there is nothing
11084                                                              preventing a store
11085                                                              release followed by
11086                                                              load acquire from
11087                                                              completing out of
11088                                                              order. The s_waitcnt
11089                                                              could be placed after
11090                                                              seq_store or before
11091                                                              the seq_load. We
11092                                                              choose the load to
11093                                                              make the s_waitcnt be
11094                                                              as late as possible
11095                                                              so that the store
11096                                                              may have already
11097                                                              completed.)
11098
11099                                                          2. *Following
11100                                                             instructions same as
11101                                                             corresponding load
11102                                                             atomic acquire,
11103                                                             except must generate
11104                                                             all instructions even
11105                                                             for OpenCL.*
11106      load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
11107                                                          local address space cannot
11108                                                          be used.*
11109
11110                                                          *Same as corresponding
11111                                                          load atomic acquire,
11112                                                          except must generate
11113                                                          all instructions even
11114                                                          for OpenCL.*
11115
11116      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
11117                                - system       - generic     vmcnt(0)
11118
11119                                                            - If TgSplit execution mode,
11120                                                              omit lgkmcnt(0).
11121                                                            - Could be split into
11122                                                              separate s_waitcnt
11123                                                              vmcnt(0)
11124                                                              and s_waitcnt
11125                                                              lgkmcnt(0) to allow
11126                                                              them to be
11127                                                              independently moved
11128                                                              according to the
11129                                                              following rules.
11130                                                            - s_waitcnt lgkmcnt(0)
11131                                                              must happen after
11132                                                              preceding
11133                                                              global/generic load
11134                                                              atomic/store
11135                                                              atomic/atomicrmw
11136                                                              with memory
11137                                                              ordering of seq_cst
11138                                                              and with equal or
11139                                                              wider sync scope.
11140                                                              (Note that seq_cst
11141                                                              fences have their
11142                                                              own s_waitcnt
11143                                                              lgkmcnt(0) and so do
11144                                                              not need to be
11145                                                              considered.)
11146                                                            - s_waitcnt vmcnt(0)
11147                                                              must happen after
11148                                                              preceding
11149                                                              global/generic load
11150                                                              atomic/store
11151                                                              atomic/atomicrmw
11152                                                              with memory
11153                                                              ordering of seq_cst
11154                                                              and with equal or
11155                                                              wider sync scope.
11156                                                              (Note that seq_cst
11157                                                              fences have their
11158                                                              own s_waitcnt
11159                                                              vmcnt(0) and so do
11160                                                              not need to be
11161                                                              considered.)
11162                                                            - Ensures any
11163                                                              preceding
11164                                                              sequential
11165                                                              consistent global
11166                                                              memory instructions
11167                                                              have completed
11168                                                              before executing
11169                                                              this sequentially
11170                                                              consistent
11171                                                              instruction. This
11172                                                              prevents reordering
11173                                                              a seq_cst store
11174                                                              followed by a
11175                                                              seq_cst load. (Note
11176                                                              that seq_cst is
11177                                                              stronger than
11178                                                              acquire/release as
11179                                                              the reordering of
11180                                                              load acquire
11181                                                              followed by a store
11182                                                              release is
11183                                                              prevented by the
11184                                                              s_waitcnt of
11185                                                              the release, but
11186                                                              there is nothing
11187                                                              preventing a store
11188                                                              release followed by
11189                                                              load acquire from
11190                                                              completing out of
11191                                                              order. The s_waitcnt
11192                                                              could be placed after
11193                                                              seq_store or before
11194                                                              the seq_load. We
11195                                                              choose the load to
11196                                                              make the s_waitcnt be
11197                                                              as late as possible
11198                                                              so that the store
11199                                                              may have already
11200                                                              completed.)
11201
11202                                                          2. *Following
11203                                                             instructions same as
11204                                                             corresponding load
11205                                                             atomic acquire,
11206                                                             except must generate
11207                                                             all instructions even
11208                                                             for OpenCL.*
11209      store atomic seq_cst      - singlethread - global   *Same as corresponding
11210                                - wavefront    - local    store atomic release,
11211                                - workgroup    - generic  except must generate
11212                                - agent                   all instructions even
11213                                - system                  for OpenCL.*
11214      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
11215                                - wavefront    - local    atomicrmw acq_rel,
11216                                - workgroup    - generic  except must generate
11217                                - agent                   all instructions even
11218                                - system                  for OpenCL.*
11219      fence        seq_cst      - singlethread *none*     *Same as corresponding
11220                                - wavefront               fence acq_rel,
11221                                - workgroup               except must generate
11222                                - agent                   all instructions even
11223                                - system                  for OpenCL.*
11224      ============ ============ ============== ========== ================================
11225
11226 .. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
11227
11228 Memory Model GFX10-GFX11
11229 ++++++++++++++++++++++++
11230
11231 For GFX10-GFX11:
11232
11233 * Each agent has multiple shader arrays (SA).
11234 * Each SA has multiple work-group processors (WGP).
11235 * Each WGP has multiple compute units (CU).
11236 * Each CU has multiple SIMDs that execute wavefronts.
11237 * The wavefronts for a single work-group are executed in the same
11238   WGP. In CU wavefront execution mode the wavefronts may be executed by
11239   different SIMDs in the same CU. In WGP wavefront execution mode the
11240   wavefronts may be executed by different SIMDs in different CUs in the same
11241   WGP.
11242 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
11243   executing on it.
11244 * All LDS operations of a WGP are performed as wavefront wide operations in a
11245   global order and involve no caching. Completion is reported to a wavefront in
11246   execution order.
11247 * The LDS memory has multiple request queues shared by the SIMDs of a
11248   WGP. Therefore, the LDS operations performed by different wavefronts of a
11249   work-group can be reordered relative to each other, which can result in
11250   reordering the visibility of vector memory operations with respect to LDS
11251   operations of other wavefronts in the same work-group. A ``s_waitcnt
11252   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
11253   vector memory operations between wavefronts of a work-group, but not between
11254   operations performed by the same wavefront.
11255 * The vector memory operations are performed as wavefront wide operations.
11256   Completion of load/store/sample operations are reported to a wavefront in
11257   execution order of other load/store/sample operations performed by that
11258   wavefront.
11259 * The vector memory operations access a vector L0 cache. There is a single L0
11260   cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
11261   special action is required for coherence between the lanes of a single
11262   wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
11263   wavefronts executing in the same work-group as they may be executing on SIMDs
11264   of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
11265   required for coherence between wavefronts executing in different work-groups
11266   as they may be executing on different WGPs.
11267 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
11268   on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
11269   operations are used in a restricted way so do not impact the memory model. See
11270   :ref:`amdgpu-amdhsa-memory-spaces`.
11271 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
11272   the same SA. Therefore, no special action is required for coherence between
11273   the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
11274   required for coherence between wavefronts executing in different work-groups
11275   as they may be executing on different SAs that access different L1s.
11276 * The L1 caches have independent quadrants to service disjoint ranges of virtual
11277   addresses.
11278 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
11279   vector and scalar memory operations performed by different wavefronts, whether
11280   executing in the same or different work-groups (which may be executing on
11281   different CUs accessing different L0s), can be reordered relative to each
11282   other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
11283   synchronization between vector memory operations of different wavefronts. It
11284   ensures a previous vector memory operation has completed before executing a
11285   subsequent vector memory or LDS operation and so can be used to meet the
11286   requirements of acquire, release and sequential consistency.
11287 * The L1 caches use an L2 cache shared by all SAs on the same agent.
11288 * The L2 cache has independent channels to service disjoint ranges of virtual
11289   addresses.
11290 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
11291   quadrant has a separate request queue per L2 channel. Therefore, the vector
11292   and scalar memory operations performed by wavefronts executing in different
11293   work-groups (which may be executing on different SAs) of an agent can be
11294   reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
11295   required to ensure synchronization between vector memory operations of
11296   different SAs. It ensures a previous vector memory operation has completed
11297   before executing a subsequent vector memory and so can be used to meet the
11298   requirements of acquire, release and sequential consistency.
11299 * The L2 cache can be kept coherent with other agents on some targets, or ranges
11300   of virtual addresses can be set up to bypass it to ensure system coherence.
11301 * On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
11302   The MALL cache is fully coherent with GPU memory and has no impact on system
11303   coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
11304
11305 Scalar memory operations are only used to access memory that is proven to not
11306 change during the execution of the kernel dispatch. This includes constant
11307 address space and global address space for program scope ``const`` variables.
11308 Therefore, the kernel machine code does not have to maintain the scalar cache to
11309 ensure it is coherent with the vector caches. The scalar and vector caches are
11310 invalidated between kernel dispatches by CP since constant address space data
11311 may change between kernel dispatch executions. See
11312 :ref:`amdgpu-amdhsa-memory-spaces`.
11313
11314 The one exception is if scalar writes are used to spill SGPR registers. In this
11315 case the AMDGPU backend ensures the memory location used to spill is never
11316 accessed by vector memory operations at the same time. If scalar writes are used
11317 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
11318 return since the locations may be used for vector memory instructions by a
11319 future wavefront that uses the same scratch area, or a function call that
11320 creates a frame at the same address, respectively. There is no need for a
11321 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
11322
11323 For kernarg backing memory:
11324
11325 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
11326 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
11327   needing to invalidate the L2 cache.
11328 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
11329   so the L2 cache will be coherent with the CPU and other agents.
11330
11331 Scratch backing memory (which is used for the private address space) is accessed
11332 with MTYPE NC (non-coherent). Since the private address space is only accessed
11333 by a single thread, and is always write-before-read, there is never a need to
11334 invalidate these entries from the L0 or L1 caches.
11335
11336 Wavefronts are executed in native mode with in-order reporting of loads and
11337 sample instructions. In this mode vmcnt reports completion of load, atomic with
11338 return and sample instructions in order, and the vscnt reports the completion of
11339 store and atomic without return in order. See ``MEM_ORDERED`` field in
11340 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
11341
11342 Wavefronts can be executed in WGP or CU wavefront execution mode:
11343
11344 * In WGP wavefront execution mode the wavefronts of a work-group are executed
11345   on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
11346   CU L0 caches is required for work-group synchronization. Also accesses to L1
11347   at work-group scope need to be explicitly ordered as the accesses from
11348   different CUs are not ordered.
11349 * In CU wavefront execution mode the wavefronts of a work-group are executed on
11350   the SIMDs of a single CU of the WGP. Therefore, all global memory access by
11351   the work-group access the same L0 which in turn ensures L1 accesses are
11352   ordered and so do not require explicit management of the caches for
11353   work-group synchronization.
11354
11355 See ``WGP_MODE`` field in
11356 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table` and
11357 :ref:`amdgpu-target-features`.
11358
11359 The code sequences used to implement the memory model for GFX10-GFX11 are defined in
11360 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
11361
11362   .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
11363      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
11364
11365      ============ ============ ============== ========== ================================
11366      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
11367                   Ordering     Sync Scope     Address    GFX10-GFX11
11368                                               Space
11369      ============ ============ ============== ========== ================================
11370      **Non-Atomic**
11371      ------------------------------------------------------------------------------------
11372      load         *none*       *none*         - global   - !volatile & !nontemporal
11373                                               - generic
11374                                               - private    1. buffer/global/flat_load
11375                                               - constant
11376                                                          - !volatile & nontemporal
11377
11378                                                            1. buffer/global/flat_load
11379                                                               slc=1 dlc=1
11380
11381                                                             - If GFX10, omit dlc=1.
11382
11383                                                          - volatile
11384
11385                                                            1. buffer/global/flat_load
11386                                                               glc=1 dlc=1
11387
11388                                                            2. s_waitcnt vmcnt(0)
11389
11390                                                             - Must happen before
11391                                                               any following volatile
11392                                                               global/generic
11393                                                               load/store.
11394                                                             - Ensures that
11395                                                               volatile
11396                                                               operations to
11397                                                               different
11398                                                               addresses will not
11399                                                               be reordered by
11400                                                               hardware.
11401
11402      load         *none*       *none*         - local    1. ds_load
11403      store        *none*       *none*         - global   - !volatile & !nontemporal
11404                                               - generic
11405                                               - private    1. buffer/global/flat_store
11406                                               - constant
11407                                                          - !volatile & nontemporal
11408
11409                                                            1. buffer/global/flat_store
11410                                                               glc=1 slc=1 dlc=1
11411
11412                                                             - If GFX10, omit dlc=1.
11413
11414                                                          - volatile
11415
11416                                                            1. buffer/global/flat_store
11417                                                               dlc=1
11418
11419                                                             - If GFX10, omit dlc=1.
11420
11421                                                            2. s_waitcnt vscnt(0)
11422
11423                                                             - Must happen before
11424                                                               any following volatile
11425                                                               global/generic
11426                                                               load/store.
11427                                                             - Ensures that
11428                                                               volatile
11429                                                               operations to
11430                                                               different
11431                                                               addresses will not
11432                                                               be reordered by
11433                                                               hardware.
11434
11435      store        *none*       *none*         - local    1. ds_store
11436      **Unordered Atomic**
11437      ------------------------------------------------------------------------------------
11438      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
11439      store atomic unordered    *any*          *any*      *Same as non-atomic*.
11440      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
11441      **Monotonic Atomic**
11442      ------------------------------------------------------------------------------------
11443      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
11444                                - wavefront    - generic
11445      load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
11446                                               - generic     glc=1
11447
11448                                                            - If CU wavefront execution
11449                                                              mode, omit glc=1.
11450
11451      load atomic  monotonic    - singlethread - local    1. ds_load
11452                                - wavefront
11453                                - workgroup
11454      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
11455                                - system       - generic     glc=1 dlc=1
11456
11457                                                            - If GFX11, omit dlc=1.
11458
11459      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
11460                                - wavefront    - generic
11461                                - workgroup
11462                                - agent
11463                                - system
11464      store atomic monotonic    - singlethread - local    1. ds_store
11465                                - wavefront
11466                                - workgroup
11467      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
11468                                - wavefront    - generic
11469                                - workgroup
11470                                - agent
11471                                - system
11472      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
11473                                - wavefront
11474                                - workgroup
11475      **Acquire Atomic**
11476      ------------------------------------------------------------------------------------
11477      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
11478                                - wavefront    - local
11479                                               - generic
11480      load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
11481
11482                                                            - If CU wavefront execution
11483                                                              mode, omit glc=1.
11484
11485                                                          2. s_waitcnt vmcnt(0)
11486
11487                                                            - If CU wavefront execution
11488                                                              mode, omit.
11489                                                            - Must happen before
11490                                                              the following buffer_gl0_inv
11491                                                              and before any following
11492                                                              global/generic
11493                                                              load/load
11494                                                              atomic/store/store
11495                                                              atomic/atomicrmw.
11496
11497                                                          3. buffer_gl0_inv
11498
11499                                                            - If CU wavefront execution
11500                                                              mode, omit.
11501                                                            - Ensures that
11502                                                              following
11503                                                              loads will not see
11504                                                              stale data.
11505
11506      load atomic  acquire      - workgroup    - local    1. ds_load
11507                                                          2. s_waitcnt lgkmcnt(0)
11508
11509                                                            - If OpenCL, omit.
11510                                                            - Must happen before
11511                                                              the following buffer_gl0_inv
11512                                                              and before any following
11513                                                              global/generic load/load
11514                                                              atomic/store/store
11515                                                              atomic/atomicrmw.
11516                                                            - Ensures any
11517                                                              following global
11518                                                              data read is no
11519                                                              older than the local load
11520                                                              atomic value being
11521                                                              acquired.
11522
11523                                                          3. buffer_gl0_inv
11524
11525                                                            - If CU wavefront execution
11526                                                              mode, omit.
11527                                                            - If OpenCL, omit.
11528                                                            - Ensures that
11529                                                              following
11530                                                              loads will not see
11531                                                              stale data.
11532
11533      load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
11534
11535                                                            - If CU wavefront execution
11536                                                              mode, omit glc=1.
11537
11538                                                          2. s_waitcnt lgkmcnt(0) &
11539                                                             vmcnt(0)
11540
11541                                                            - If CU wavefront execution
11542                                                              mode, omit vmcnt(0).
11543                                                            - If OpenCL, omit
11544                                                              lgkmcnt(0).
11545                                                            - Must happen before
11546                                                              the following
11547                                                              buffer_gl0_inv and any
11548                                                              following global/generic
11549                                                              load/load
11550                                                              atomic/store/store
11551                                                              atomic/atomicrmw.
11552                                                            - Ensures any
11553                                                              following global
11554                                                              data read is no
11555                                                              older than a local load
11556                                                              atomic value being
11557                                                              acquired.
11558
11559                                                          3. buffer_gl0_inv
11560
11561                                                            - If CU wavefront execution
11562                                                              mode, omit.
11563                                                            - Ensures that
11564                                                              following
11565                                                              loads will not see
11566                                                              stale data.
11567
11568      load atomic  acquire      - agent        - global   1. buffer/global_load
11569                                - system                     glc=1 dlc=1
11570
11571                                                            - If GFX11, omit dlc=1.
11572
11573                                                          2. s_waitcnt vmcnt(0)
11574
11575                                                            - Must happen before
11576                                                              following
11577                                                              buffer_gl*_inv.
11578                                                            - Ensures the load
11579                                                              has completed
11580                                                              before invalidating
11581                                                              the caches.
11582
11583                                                          3. buffer_gl0_inv;
11584                                                             buffer_gl1_inv
11585
11586                                                            - Must happen before
11587                                                              any following
11588                                                              global/generic
11589                                                              load/load
11590                                                              atomic/atomicrmw.
11591                                                            - Ensures that
11592                                                              following
11593                                                              loads will not see
11594                                                              stale global data.
11595
11596      load atomic  acquire      - agent        - generic  1. flat_load glc=1 dlc=1
11597                                - system
11598                                                            - If GFX11, omit dlc=1.
11599
11600                                                          2. s_waitcnt vmcnt(0) &
11601                                                             lgkmcnt(0)
11602
11603                                                            - If OpenCL omit
11604                                                              lgkmcnt(0).
11605                                                            - Must happen before
11606                                                              following
11607                                                              buffer_gl*_invl.
11608                                                            - Ensures the flat_load
11609                                                              has completed
11610                                                              before invalidating
11611                                                              the caches.
11612
11613                                                          3. buffer_gl0_inv;
11614                                                             buffer_gl1_inv
11615
11616                                                            - Must happen before
11617                                                              any following
11618                                                              global/generic
11619                                                              load/load
11620                                                              atomic/atomicrmw.
11621                                                            - Ensures that
11622                                                              following loads
11623                                                              will not see stale
11624                                                              global data.
11625
11626      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
11627                                - wavefront    - local
11628                                               - generic
11629      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
11630                                                          2. s_waitcnt vm/vscnt(0)
11631
11632                                                            - If CU wavefront execution
11633                                                              mode, omit.
11634                                                            - Use vmcnt(0) if atomic with
11635                                                              return and vscnt(0) if
11636                                                              atomic with no-return.
11637                                                            - Must happen before
11638                                                              the following buffer_gl0_inv
11639                                                              and before any following
11640                                                              global/generic
11641                                                              load/load
11642                                                              atomic/store/store
11643                                                              atomic/atomicrmw.
11644
11645                                                          3. buffer_gl0_inv
11646
11647                                                            - If CU wavefront execution
11648                                                              mode, omit.
11649                                                            - Ensures that
11650                                                              following
11651                                                              loads will not see
11652                                                              stale data.
11653
11654      atomicrmw    acquire      - workgroup    - local    1. ds_atomic
11655                                                          2. s_waitcnt lgkmcnt(0)
11656
11657                                                            - If OpenCL, omit.
11658                                                            - Must happen before
11659                                                              the following
11660                                                              buffer_gl0_inv.
11661                                                            - Ensures any
11662                                                              following global
11663                                                              data read is no
11664                                                              older than the local
11665                                                              atomicrmw value
11666                                                              being acquired.
11667
11668                                                          3. buffer_gl0_inv
11669
11670                                                            - If OpenCL omit.
11671                                                            - Ensures that
11672                                                              following
11673                                                              loads will not see
11674                                                              stale data.
11675
11676      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
11677                                                          2. s_waitcnt lgkmcnt(0) &
11678                                                             vm/vscnt(0)
11679
11680                                                            - If CU wavefront execution
11681                                                              mode, omit vm/vscnt(0).
11682                                                            - If OpenCL, omit lgkmcnt(0).
11683                                                            - Use vmcnt(0) if atomic with
11684                                                              return and vscnt(0) if
11685                                                              atomic with no-return.
11686                                                            - Must happen before
11687                                                              the following
11688                                                              buffer_gl0_inv.
11689                                                            - Ensures any
11690                                                              following global
11691                                                              data read is no
11692                                                              older than a local
11693                                                              atomicrmw value
11694                                                              being acquired.
11695
11696                                                          3. buffer_gl0_inv
11697
11698                                                            - If CU wavefront execution
11699                                                              mode, omit.
11700                                                            - Ensures that
11701                                                              following
11702                                                              loads will not see
11703                                                              stale data.
11704
11705      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
11706                                - system                  2. s_waitcnt vm/vscnt(0)
11707
11708                                                            - Use vmcnt(0) if atomic with
11709                                                              return and vscnt(0) if
11710                                                              atomic with no-return.
11711                                                            - Must happen before
11712                                                              following
11713                                                              buffer_gl*_inv.
11714                                                            - Ensures the
11715                                                              atomicrmw has
11716                                                              completed before
11717                                                              invalidating the
11718                                                              caches.
11719
11720                                                          3. buffer_gl0_inv;
11721                                                             buffer_gl1_inv
11722
11723                                                            - Must happen before
11724                                                              any following
11725                                                              global/generic
11726                                                              load/load
11727                                                              atomic/atomicrmw.
11728                                                            - Ensures that
11729                                                              following loads
11730                                                              will not see stale
11731                                                              global data.
11732
11733      atomicrmw    acquire      - agent        - generic  1. flat_atomic
11734                                - system                  2. s_waitcnt vm/vscnt(0) &
11735                                                             lgkmcnt(0)
11736
11737                                                            - If OpenCL, omit
11738                                                              lgkmcnt(0).
11739                                                            - Use vmcnt(0) if atomic with
11740                                                              return and vscnt(0) if
11741                                                              atomic with no-return.
11742                                                            - Must happen before
11743                                                              following
11744                                                              buffer_gl*_inv.
11745                                                            - Ensures the
11746                                                              atomicrmw has
11747                                                              completed before
11748                                                              invalidating the
11749                                                              caches.
11750
11751                                                          3. buffer_gl0_inv;
11752                                                             buffer_gl1_inv
11753
11754                                                            - Must happen before
11755                                                              any following
11756                                                              global/generic
11757                                                              load/load
11758                                                              atomic/atomicrmw.
11759                                                            - Ensures that
11760                                                              following loads
11761                                                              will not see stale
11762                                                              global data.
11763
11764      fence        acquire      - singlethread *none*     *none*
11765                                - wavefront
11766      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
11767                                                             vmcnt(0) & vscnt(0)
11768
11769                                                            - If CU wavefront execution
11770                                                              mode, omit vmcnt(0) and
11771                                                              vscnt(0).
11772                                                            - If OpenCL and
11773                                                              address space is
11774                                                              not generic, omit
11775                                                              lgkmcnt(0).
11776                                                            - If OpenCL and
11777                                                              address space is
11778                                                              local, omit
11779                                                              vmcnt(0) and vscnt(0).
11780                                                            - However, since LLVM
11781                                                              currently has no
11782                                                              address space on
11783                                                              the fence need to
11784                                                              conservatively
11785                                                              always generate. If
11786                                                              fence had an
11787                                                              address space then
11788                                                              set to address
11789                                                              space of OpenCL
11790                                                              fence flag, or to
11791                                                              generic if both
11792                                                              local and global
11793                                                              flags are
11794                                                              specified.
11795                                                            - Could be split into
11796                                                              separate s_waitcnt
11797                                                              vmcnt(0), s_waitcnt
11798                                                              vscnt(0) and s_waitcnt
11799                                                              lgkmcnt(0) to allow
11800                                                              them to be
11801                                                              independently moved
11802                                                              according to the
11803                                                              following rules.
11804                                                            - s_waitcnt vmcnt(0)
11805                                                              must happen after
11806                                                              any preceding
11807                                                              global/generic load
11808                                                              atomic/
11809                                                              atomicrmw-with-return-value
11810                                                              with an equal or
11811                                                              wider sync scope
11812                                                              and memory ordering
11813                                                              stronger than
11814                                                              unordered (this is
11815                                                              termed the
11816                                                              fence-paired-atomic).
11817                                                            - s_waitcnt vscnt(0)
11818                                                              must happen after
11819                                                              any preceding
11820                                                              global/generic
11821                                                              atomicrmw-no-return-value
11822                                                              with an equal or
11823                                                              wider sync scope
11824                                                              and memory ordering
11825                                                              stronger than
11826                                                              unordered (this is
11827                                                              termed the
11828                                                              fence-paired-atomic).
11829                                                            - s_waitcnt lgkmcnt(0)
11830                                                              must happen after
11831                                                              any preceding
11832                                                              local/generic load
11833                                                              atomic/atomicrmw
11834                                                              with an equal or
11835                                                              wider sync scope
11836                                                              and memory ordering
11837                                                              stronger than
11838                                                              unordered (this is
11839                                                              termed the
11840                                                              fence-paired-atomic).
11841                                                            - Must happen before
11842                                                              the following
11843                                                              buffer_gl0_inv.
11844                                                            - Ensures that the
11845                                                              fence-paired atomic
11846                                                              has completed
11847                                                              before invalidating
11848                                                              the
11849                                                              cache. Therefore
11850                                                              any following
11851                                                              locations read must
11852                                                              be no older than
11853                                                              the value read by
11854                                                              the
11855                                                              fence-paired-atomic.
11856
11857                                                          3. buffer_gl0_inv
11858
11859                                                            - If CU wavefront execution
11860                                                              mode, omit.
11861                                                            - Ensures that
11862                                                              following
11863                                                              loads will not see
11864                                                              stale data.
11865
11866      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
11867                                - system                     vmcnt(0) & vscnt(0)
11868
11869                                                            - If OpenCL and
11870                                                              address space is
11871                                                              not generic, omit
11872                                                              lgkmcnt(0).
11873                                                            - If OpenCL and
11874                                                              address space is
11875                                                              local, omit
11876                                                              vmcnt(0) and vscnt(0).
11877                                                            - However, since LLVM
11878                                                              currently has no
11879                                                              address space on
11880                                                              the fence need to
11881                                                              conservatively
11882                                                              always generate
11883                                                              (see comment for
11884                                                              previous fence).
11885                                                            - Could be split into
11886                                                              separate s_waitcnt
11887                                                              vmcnt(0), s_waitcnt
11888                                                              vscnt(0) and s_waitcnt
11889                                                              lgkmcnt(0) to allow
11890                                                              them to be
11891                                                              independently moved
11892                                                              according to the
11893                                                              following rules.
11894                                                            - s_waitcnt vmcnt(0)
11895                                                              must happen after
11896                                                              any preceding
11897                                                              global/generic load
11898                                                              atomic/
11899                                                              atomicrmw-with-return-value
11900                                                              with an equal or
11901                                                              wider sync scope
11902                                                              and memory ordering
11903                                                              stronger than
11904                                                              unordered (this is
11905                                                              termed the
11906                                                              fence-paired-atomic).
11907                                                            - s_waitcnt vscnt(0)
11908                                                              must happen after
11909                                                              any preceding
11910                                                              global/generic
11911                                                              atomicrmw-no-return-value
11912                                                              with an equal or
11913                                                              wider sync scope
11914                                                              and memory ordering
11915                                                              stronger than
11916                                                              unordered (this is
11917                                                              termed the
11918                                                              fence-paired-atomic).
11919                                                            - s_waitcnt lgkmcnt(0)
11920                                                              must happen after
11921                                                              any preceding
11922                                                              local/generic load
11923                                                              atomic/atomicrmw
11924                                                              with an equal or
11925                                                              wider sync scope
11926                                                              and memory ordering
11927                                                              stronger than
11928                                                              unordered (this is
11929                                                              termed the
11930                                                              fence-paired-atomic).
11931                                                            - Must happen before
11932                                                              the following
11933                                                              buffer_gl*_inv.
11934                                                            - Ensures that the
11935                                                              fence-paired atomic
11936                                                              has completed
11937                                                              before invalidating
11938                                                              the
11939                                                              caches. Therefore
11940                                                              any following
11941                                                              locations read must
11942                                                              be no older than
11943                                                              the value read by
11944                                                              the
11945                                                              fence-paired-atomic.
11946
11947                                                          2. buffer_gl0_inv;
11948                                                             buffer_gl1_inv
11949
11950                                                            - Must happen before any
11951                                                              following global/generic
11952                                                              load/load
11953                                                              atomic/store/store
11954                                                              atomic/atomicrmw.
11955                                                            - Ensures that
11956                                                              following loads
11957                                                              will not see stale
11958                                                              global data.
11959
11960      **Release Atomic**
11961      ------------------------------------------------------------------------------------
11962      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
11963                                - wavefront    - local
11964                                               - generic
11965      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
11966                                               - generic     vmcnt(0) & vscnt(0)
11967
11968                                                            - If CU wavefront execution
11969                                                              mode, omit vmcnt(0) and
11970                                                              vscnt(0).
11971                                                            - If OpenCL, omit
11972                                                              lgkmcnt(0).
11973                                                            - Could be split into
11974                                                              separate s_waitcnt
11975                                                              vmcnt(0), s_waitcnt
11976                                                              vscnt(0) and s_waitcnt
11977                                                              lgkmcnt(0) to allow
11978                                                              them to be
11979                                                              independently moved
11980                                                              according to the
11981                                                              following rules.
11982                                                            - s_waitcnt vmcnt(0)
11983                                                              must happen after
11984                                                              any preceding
11985                                                              global/generic load/load
11986                                                              atomic/
11987                                                              atomicrmw-with-return-value.
11988                                                            - s_waitcnt vscnt(0)
11989                                                              must happen after
11990                                                              any preceding
11991                                                              global/generic
11992                                                              store/store
11993                                                              atomic/
11994                                                              atomicrmw-no-return-value.
11995                                                            - s_waitcnt lgkmcnt(0)
11996                                                              must happen after
11997                                                              any preceding
11998                                                              local/generic
11999                                                              load/store/load
12000                                                              atomic/store
12001                                                              atomic/atomicrmw.
12002                                                            - Must happen before
12003                                                              the following
12004                                                              store.
12005                                                            - Ensures that all
12006                                                              memory operations
12007                                                              have
12008                                                              completed before
12009                                                              performing the
12010                                                              store that is being
12011                                                              released.
12012
12013                                                          2. buffer/global/flat_store
12014      store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12015
12016                                                            - If CU wavefront execution
12017                                                              mode, omit.
12018                                                            - If OpenCL, omit.
12019                                                            - Could be split into
12020                                                              separate s_waitcnt
12021                                                              vmcnt(0) and s_waitcnt
12022                                                              vscnt(0) to allow
12023                                                              them to be
12024                                                              independently moved
12025                                                              according to the
12026                                                              following rules.
12027                                                            - s_waitcnt vmcnt(0)
12028                                                              must happen after
12029                                                              any preceding
12030                                                              global/generic load/load
12031                                                              atomic/
12032                                                              atomicrmw-with-return-value.
12033                                                            - s_waitcnt vscnt(0)
12034                                                              must happen after
12035                                                              any preceding
12036                                                              global/generic
12037                                                              store/store atomic/
12038                                                              atomicrmw-no-return-value.
12039                                                            - Must happen before
12040                                                              the following
12041                                                              store.
12042                                                            - Ensures that all
12043                                                              global memory
12044                                                              operations have
12045                                                              completed before
12046                                                              performing the
12047                                                              store that is being
12048                                                              released.
12049
12050                                                          2. ds_store
12051      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12052                                - system       - generic     vmcnt(0) & vscnt(0)
12053
12054                                                            - If OpenCL and
12055                                                              address space is
12056                                                              not generic, omit
12057                                                              lgkmcnt(0).
12058                                                            - Could be split into
12059                                                              separate s_waitcnt
12060                                                              vmcnt(0), s_waitcnt vscnt(0)
12061                                                              and s_waitcnt
12062                                                              lgkmcnt(0) to allow
12063                                                              them to be
12064                                                              independently moved
12065                                                              according to the
12066                                                              following rules.
12067                                                            - s_waitcnt vmcnt(0)
12068                                                              must happen after
12069                                                              any preceding
12070                                                              global/generic
12071                                                              load/load
12072                                                              atomic/
12073                                                              atomicrmw-with-return-value.
12074                                                            - s_waitcnt vscnt(0)
12075                                                              must happen after
12076                                                              any preceding
12077                                                              global/generic
12078                                                              store/store atomic/
12079                                                              atomicrmw-no-return-value.
12080                                                            - s_waitcnt lgkmcnt(0)
12081                                                              must happen after
12082                                                              any preceding
12083                                                              local/generic
12084                                                              load/store/load
12085                                                              atomic/store
12086                                                              atomic/atomicrmw.
12087                                                            - Must happen before
12088                                                              the following
12089                                                              store.
12090                                                            - Ensures that all
12091                                                              memory operations
12092                                                              have
12093                                                              completed before
12094                                                              performing the
12095                                                              store that is being
12096                                                              released.
12097
12098                                                          2. buffer/global/flat_store
12099      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
12100                                - wavefront    - local
12101                                               - generic
12102      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12103                                               - generic     vmcnt(0) & vscnt(0)
12104
12105                                                            - If CU wavefront execution
12106                                                              mode, omit vmcnt(0) and
12107                                                              vscnt(0).
12108                                                            - If OpenCL, omit lgkmcnt(0).
12109                                                            - Could be split into
12110                                                              separate s_waitcnt
12111                                                              vmcnt(0), s_waitcnt
12112                                                              vscnt(0) and s_waitcnt
12113                                                              lgkmcnt(0) to allow
12114                                                              them to be
12115                                                              independently moved
12116                                                              according to the
12117                                                              following rules.
12118                                                            - s_waitcnt vmcnt(0)
12119                                                              must happen after
12120                                                              any preceding
12121                                                              global/generic load/load
12122                                                              atomic/
12123                                                              atomicrmw-with-return-value.
12124                                                            - s_waitcnt vscnt(0)
12125                                                              must happen after
12126                                                              any preceding
12127                                                              global/generic
12128                                                              store/store
12129                                                              atomic/
12130                                                              atomicrmw-no-return-value.
12131                                                            - s_waitcnt lgkmcnt(0)
12132                                                              must happen after
12133                                                              any preceding
12134                                                              local/generic
12135                                                              load/store/load
12136                                                              atomic/store
12137                                                              atomic/atomicrmw.
12138                                                            - Must happen before
12139                                                              the following
12140                                                              atomicrmw.
12141                                                            - Ensures that all
12142                                                              memory operations
12143                                                              have
12144                                                              completed before
12145                                                              performing the
12146                                                              atomicrmw that is
12147                                                              being released.
12148
12149                                                          2. buffer/global/flat_atomic
12150      atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12151
12152                                                            - If CU wavefront execution
12153                                                              mode, omit.
12154                                                            - If OpenCL, omit.
12155                                                            - Could be split into
12156                                                              separate s_waitcnt
12157                                                              vmcnt(0) and s_waitcnt
12158                                                              vscnt(0) to allow
12159                                                              them to be
12160                                                              independently moved
12161                                                              according to the
12162                                                              following rules.
12163                                                            - s_waitcnt vmcnt(0)
12164                                                              must happen after
12165                                                              any preceding
12166                                                              global/generic load/load
12167                                                              atomic/
12168                                                              atomicrmw-with-return-value.
12169                                                            - s_waitcnt vscnt(0)
12170                                                              must happen after
12171                                                              any preceding
12172                                                              global/generic
12173                                                              store/store atomic/
12174                                                              atomicrmw-no-return-value.
12175                                                            - Must happen before
12176                                                              the following
12177                                                              store.
12178                                                            - Ensures that all
12179                                                              global memory
12180                                                              operations have
12181                                                              completed before
12182                                                              performing the
12183                                                              store that is being
12184                                                              released.
12185
12186                                                          2. ds_atomic
12187      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12188                                - system       - generic      vmcnt(0) & vscnt(0)
12189
12190                                                            - If OpenCL, omit
12191                                                              lgkmcnt(0).
12192                                                            - Could be split into
12193                                                              separate s_waitcnt
12194                                                              vmcnt(0), s_waitcnt
12195                                                              vscnt(0) and s_waitcnt
12196                                                              lgkmcnt(0) to allow
12197                                                              them to be
12198                                                              independently moved
12199                                                              according to the
12200                                                              following rules.
12201                                                            - s_waitcnt vmcnt(0)
12202                                                              must happen after
12203                                                              any preceding
12204                                                              global/generic
12205                                                              load/load atomic/
12206                                                              atomicrmw-with-return-value.
12207                                                            - s_waitcnt vscnt(0)
12208                                                              must happen after
12209                                                              any preceding
12210                                                              global/generic
12211                                                              store/store atomic/
12212                                                              atomicrmw-no-return-value.
12213                                                            - s_waitcnt lgkmcnt(0)
12214                                                              must happen after
12215                                                              any preceding
12216                                                              local/generic
12217                                                              load/store/load
12218                                                              atomic/store
12219                                                              atomic/atomicrmw.
12220                                                            - Must happen before
12221                                                              the following
12222                                                              atomicrmw.
12223                                                            - Ensures that all
12224                                                              memory operations
12225                                                              to global and local
12226                                                              have completed
12227                                                              before performing
12228                                                              the atomicrmw that
12229                                                              is being released.
12230
12231                                                          2. buffer/global/flat_atomic
12232      fence        release      - singlethread *none*     *none*
12233                                - wavefront
12234      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12235                                                             vmcnt(0) & vscnt(0)
12236
12237                                                            - If CU wavefront execution
12238                                                              mode, omit vmcnt(0) and
12239                                                              vscnt(0).
12240                                                            - If OpenCL and
12241                                                              address space is
12242                                                              not generic, omit
12243                                                              lgkmcnt(0).
12244                                                            - If OpenCL and
12245                                                              address space is
12246                                                              local, omit
12247                                                              vmcnt(0) and vscnt(0).
12248                                                            - However, since LLVM
12249                                                              currently has no
12250                                                              address space on
12251                                                              the fence need to
12252                                                              conservatively
12253                                                              always generate. If
12254                                                              fence had an
12255                                                              address space then
12256                                                              set to address
12257                                                              space of OpenCL
12258                                                              fence flag, or to
12259                                                              generic if both
12260                                                              local and global
12261                                                              flags are
12262                                                              specified.
12263                                                            - Could be split into
12264                                                              separate s_waitcnt
12265                                                              vmcnt(0), s_waitcnt
12266                                                              vscnt(0) and s_waitcnt
12267                                                              lgkmcnt(0) to allow
12268                                                              them to be
12269                                                              independently moved
12270                                                              according to the
12271                                                              following rules.
12272                                                            - s_waitcnt vmcnt(0)
12273                                                              must happen after
12274                                                              any preceding
12275                                                              global/generic
12276                                                              load/load
12277                                                              atomic/
12278                                                              atomicrmw-with-return-value.
12279                                                            - s_waitcnt vscnt(0)
12280                                                              must happen after
12281                                                              any preceding
12282                                                              global/generic
12283                                                              store/store atomic/
12284                                                              atomicrmw-no-return-value.
12285                                                            - s_waitcnt lgkmcnt(0)
12286                                                              must happen after
12287                                                              any preceding
12288                                                              local/generic
12289                                                              load/store/load
12290                                                              atomic/store atomic/
12291                                                              atomicrmw.
12292                                                            - Must happen before
12293                                                              any following store
12294                                                              atomic/atomicrmw
12295                                                              with an equal or
12296                                                              wider sync scope
12297                                                              and memory ordering
12298                                                              stronger than
12299                                                              unordered (this is
12300                                                              termed the
12301                                                              fence-paired-atomic).
12302                                                            - Ensures that all
12303                                                              memory operations
12304                                                              have
12305                                                              completed before
12306                                                              performing the
12307                                                              following
12308                                                              fence-paired-atomic.
12309
12310      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12311                                - system                     vmcnt(0) & vscnt(0)
12312
12313                                                            - If OpenCL and
12314                                                              address space is
12315                                                              not generic, omit
12316                                                              lgkmcnt(0).
12317                                                            - If OpenCL and
12318                                                              address space is
12319                                                              local, omit
12320                                                              vmcnt(0) and vscnt(0).
12321                                                            - However, since LLVM
12322                                                              currently has no
12323                                                              address space on
12324                                                              the fence need to
12325                                                              conservatively
12326                                                              always generate. If
12327                                                              fence had an
12328                                                              address space then
12329                                                              set to address
12330                                                              space of OpenCL
12331                                                              fence flag, or to
12332                                                              generic if both
12333                                                              local and global
12334                                                              flags are
12335                                                              specified.
12336                                                            - Could be split into
12337                                                              separate s_waitcnt
12338                                                              vmcnt(0), s_waitcnt
12339                                                              vscnt(0) and s_waitcnt
12340                                                              lgkmcnt(0) to allow
12341                                                              them to be
12342                                                              independently moved
12343                                                              according to the
12344                                                              following rules.
12345                                                            - s_waitcnt vmcnt(0)
12346                                                              must happen after
12347                                                              any preceding
12348                                                              global/generic
12349                                                              load/load atomic/
12350                                                              atomicrmw-with-return-value.
12351                                                            - s_waitcnt vscnt(0)
12352                                                              must happen after
12353                                                              any preceding
12354                                                              global/generic
12355                                                              store/store atomic/
12356                                                              atomicrmw-no-return-value.
12357                                                            - s_waitcnt lgkmcnt(0)
12358                                                              must happen after
12359                                                              any preceding
12360                                                              local/generic
12361                                                              load/store/load
12362                                                              atomic/store
12363                                                              atomic/atomicrmw.
12364                                                            - Must happen before
12365                                                              any following store
12366                                                              atomic/atomicrmw
12367                                                              with an equal or
12368                                                              wider sync scope
12369                                                              and memory ordering
12370                                                              stronger than
12371                                                              unordered (this is
12372                                                              termed the
12373                                                              fence-paired-atomic).
12374                                                            - Ensures that all
12375                                                              memory operations
12376                                                              have
12377                                                              completed before
12378                                                              performing the
12379                                                              following
12380                                                              fence-paired-atomic.
12381
12382      **Acquire-Release Atomic**
12383      ------------------------------------------------------------------------------------
12384      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
12385                                - wavefront    - local
12386                                               - generic
12387      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12388                                                             vmcnt(0) & vscnt(0)
12389
12390                                                            - If CU wavefront execution
12391                                                              mode, omit vmcnt(0) and
12392                                                              vscnt(0).
12393                                                            - If OpenCL, omit
12394                                                              lgkmcnt(0).
12395                                                            - Must happen after
12396                                                              any preceding
12397                                                              local/generic
12398                                                              load/store/load
12399                                                              atomic/store
12400                                                              atomic/atomicrmw.
12401                                                            - Could be split into
12402                                                              separate s_waitcnt
12403                                                              vmcnt(0), s_waitcnt
12404                                                              vscnt(0), and s_waitcnt
12405                                                              lgkmcnt(0) to allow
12406                                                              them to be
12407                                                              independently moved
12408                                                              according to the
12409                                                              following rules.
12410                                                            - s_waitcnt vmcnt(0)
12411                                                              must happen after
12412                                                              any preceding
12413                                                              global/generic load/load
12414                                                              atomic/
12415                                                              atomicrmw-with-return-value.
12416                                                            - s_waitcnt vscnt(0)
12417                                                              must happen after
12418                                                              any preceding
12419                                                              global/generic
12420                                                              store/store
12421                                                              atomic/
12422                                                              atomicrmw-no-return-value.
12423                                                            - s_waitcnt lgkmcnt(0)
12424                                                              must happen after
12425                                                              any preceding
12426                                                              local/generic
12427                                                              load/store/load
12428                                                              atomic/store
12429                                                              atomic/atomicrmw.
12430                                                            - Must happen before
12431                                                              the following
12432                                                              atomicrmw.
12433                                                            - Ensures that all
12434                                                              memory operations
12435                                                              have
12436                                                              completed before
12437                                                              performing the
12438                                                              atomicrmw that is
12439                                                              being released.
12440
12441                                                          2. buffer/global_atomic
12442                                                          3. s_waitcnt vm/vscnt(0)
12443
12444                                                            - If CU wavefront execution
12445                                                              mode, omit.
12446                                                            - Use vmcnt(0) if atomic with
12447                                                              return and vscnt(0) if
12448                                                              atomic with no-return.
12449                                                            - Must happen before
12450                                                              the following
12451                                                              buffer_gl0_inv.
12452                                                            - Ensures any
12453                                                              following global
12454                                                              data read is no
12455                                                              older than the
12456                                                              atomicrmw value
12457                                                              being acquired.
12458
12459                                                          4. buffer_gl0_inv
12460
12461                                                            - If CU wavefront execution
12462                                                              mode, omit.
12463                                                            - Ensures that
12464                                                              following
12465                                                              loads will not see
12466                                                              stale data.
12467
12468      atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
12469
12470                                                            - If CU wavefront execution
12471                                                              mode, omit.
12472                                                            - If OpenCL, omit.
12473                                                            - Could be split into
12474                                                              separate s_waitcnt
12475                                                              vmcnt(0) and s_waitcnt
12476                                                              vscnt(0) to allow
12477                                                              them to be
12478                                                              independently moved
12479                                                              according to the
12480                                                              following rules.
12481                                                            - s_waitcnt vmcnt(0)
12482                                                              must happen after
12483                                                              any preceding
12484                                                              global/generic load/load
12485                                                              atomic/
12486                                                              atomicrmw-with-return-value.
12487                                                            - s_waitcnt vscnt(0)
12488                                                              must happen after
12489                                                              any preceding
12490                                                              global/generic
12491                                                              store/store atomic/
12492                                                              atomicrmw-no-return-value.
12493                                                            - Must happen before
12494                                                              the following
12495                                                              store.
12496                                                            - Ensures that all
12497                                                              global memory
12498                                                              operations have
12499                                                              completed before
12500                                                              performing the
12501                                                              store that is being
12502                                                              released.
12503
12504                                                          2. ds_atomic
12505                                                          3. s_waitcnt lgkmcnt(0)
12506
12507                                                            - If OpenCL, omit.
12508                                                            - Must happen before
12509                                                              the following
12510                                                              buffer_gl0_inv.
12511                                                            - Ensures any
12512                                                              following global
12513                                                              data read is no
12514                                                              older than the local load
12515                                                              atomic value being
12516                                                              acquired.
12517
12518                                                          4. buffer_gl0_inv
12519
12520                                                            - If CU wavefront execution
12521                                                              mode, omit.
12522                                                            - If OpenCL omit.
12523                                                            - Ensures that
12524                                                              following
12525                                                              loads will not see
12526                                                              stale data.
12527
12528      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) &
12529                                                             vmcnt(0) & vscnt(0)
12530
12531                                                            - If CU wavefront execution
12532                                                              mode, omit vmcnt(0) and
12533                                                              vscnt(0).
12534                                                            - If OpenCL, omit lgkmcnt(0).
12535                                                            - Could be split into
12536                                                              separate s_waitcnt
12537                                                              vmcnt(0), s_waitcnt
12538                                                              vscnt(0) and s_waitcnt
12539                                                              lgkmcnt(0) to allow
12540                                                              them to be
12541                                                              independently moved
12542                                                              according to the
12543                                                              following rules.
12544                                                            - s_waitcnt vmcnt(0)
12545                                                              must happen after
12546                                                              any preceding
12547                                                              global/generic load/load
12548                                                              atomic/
12549                                                              atomicrmw-with-return-value.
12550                                                            - s_waitcnt vscnt(0)
12551                                                              must happen after
12552                                                              any preceding
12553                                                              global/generic
12554                                                              store/store
12555                                                              atomic/
12556                                                              atomicrmw-no-return-value.
12557                                                            - s_waitcnt lgkmcnt(0)
12558                                                              must happen after
12559                                                              any preceding
12560                                                              local/generic
12561                                                              load/store/load
12562                                                              atomic/store
12563                                                              atomic/atomicrmw.
12564                                                            - Must happen before
12565                                                              the following
12566                                                              atomicrmw.
12567                                                            - Ensures that all
12568                                                              memory operations
12569                                                              have
12570                                                              completed before
12571                                                              performing the
12572                                                              atomicrmw that is
12573                                                              being released.
12574
12575                                                          2. flat_atomic
12576                                                          3. s_waitcnt lgkmcnt(0) &
12577                                                             vmcnt(0) & vscnt(0)
12578
12579                                                            - If CU wavefront execution
12580                                                              mode, omit vmcnt(0) and
12581                                                              vscnt(0).
12582                                                            - If OpenCL, omit lgkmcnt(0).
12583                                                            - Must happen before
12584                                                              the following
12585                                                              buffer_gl0_inv.
12586                                                            - Ensures any
12587                                                              following global
12588                                                              data read is no
12589                                                              older than the load
12590                                                              atomic value being
12591                                                              acquired.
12592
12593                                                          3. buffer_gl0_inv
12594
12595                                                            - If CU wavefront execution
12596                                                              mode, omit.
12597                                                            - Ensures that
12598                                                              following
12599                                                              loads will not see
12600                                                              stale data.
12601
12602      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
12603                                - system                     vmcnt(0) & vscnt(0)
12604
12605                                                            - If OpenCL, omit
12606                                                              lgkmcnt(0).
12607                                                            - Could be split into
12608                                                              separate s_waitcnt
12609                                                              vmcnt(0), s_waitcnt
12610                                                              vscnt(0) and s_waitcnt
12611                                                              lgkmcnt(0) to allow
12612                                                              them to be
12613                                                              independently moved
12614                                                              according to the
12615                                                              following rules.
12616                                                            - s_waitcnt vmcnt(0)
12617                                                              must happen after
12618                                                              any preceding
12619                                                              global/generic
12620                                                              load/load atomic/
12621                                                              atomicrmw-with-return-value.
12622                                                            - s_waitcnt vscnt(0)
12623                                                              must happen after
12624                                                              any preceding
12625                                                              global/generic
12626                                                              store/store atomic/
12627                                                              atomicrmw-no-return-value.
12628                                                            - s_waitcnt lgkmcnt(0)
12629                                                              must happen after
12630                                                              any preceding
12631                                                              local/generic
12632                                                              load/store/load
12633                                                              atomic/store
12634                                                              atomic/atomicrmw.
12635                                                            - Must happen before
12636                                                              the following
12637                                                              atomicrmw.
12638                                                            - Ensures that all
12639                                                              memory operations
12640                                                              to global have
12641                                                              completed before
12642                                                              performing the
12643                                                              atomicrmw that is
12644                                                              being released.
12645
12646                                                          2. buffer/global_atomic
12647                                                          3. s_waitcnt vm/vscnt(0)
12648
12649                                                            - Use vmcnt(0) if atomic with
12650                                                              return and vscnt(0) if
12651                                                              atomic with no-return.
12652                                                            - Must happen before
12653                                                              following
12654                                                              buffer_gl*_inv.
12655                                                            - Ensures the
12656                                                              atomicrmw has
12657                                                              completed before
12658                                                              invalidating the
12659                                                              caches.
12660
12661                                                          4. buffer_gl0_inv;
12662                                                             buffer_gl1_inv
12663
12664                                                            - Must happen before
12665                                                              any following
12666                                                              global/generic
12667                                                              load/load
12668                                                              atomic/atomicrmw.
12669                                                            - Ensures that
12670                                                              following loads
12671                                                              will not see stale
12672                                                              global data.
12673
12674      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
12675                                - system                     vmcnt(0) & vscnt(0)
12676
12677                                                            - If OpenCL, omit
12678                                                              lgkmcnt(0).
12679                                                            - Could be split into
12680                                                              separate s_waitcnt
12681                                                              vmcnt(0), s_waitcnt
12682                                                              vscnt(0), and s_waitcnt
12683                                                              lgkmcnt(0) to allow
12684                                                              them to be
12685                                                              independently moved
12686                                                              according to the
12687                                                              following rules.
12688                                                            - s_waitcnt vmcnt(0)
12689                                                              must happen after
12690                                                              any preceding
12691                                                              global/generic
12692                                                              load/load atomic
12693                                                              atomicrmw-with-return-value.
12694                                                            - s_waitcnt vscnt(0)
12695                                                              must happen after
12696                                                              any preceding
12697                                                              global/generic
12698                                                              store/store atomic/
12699                                                              atomicrmw-no-return-value.
12700                                                            - s_waitcnt lgkmcnt(0)
12701                                                              must happen after
12702                                                              any preceding
12703                                                              local/generic
12704                                                              load/store/load
12705                                                              atomic/store
12706                                                              atomic/atomicrmw.
12707                                                            - Must happen before
12708                                                              the following
12709                                                              atomicrmw.
12710                                                            - Ensures that all
12711                                                              memory operations
12712                                                              have
12713                                                              completed before
12714                                                              performing the
12715                                                              atomicrmw that is
12716                                                              being released.
12717
12718                                                          2. flat_atomic
12719                                                          3. s_waitcnt vm/vscnt(0) &
12720                                                             lgkmcnt(0)
12721
12722                                                            - If OpenCL, omit
12723                                                              lgkmcnt(0).
12724                                                            - Use vmcnt(0) if atomic with
12725                                                              return and vscnt(0) if
12726                                                              atomic with no-return.
12727                                                            - Must happen before
12728                                                              following
12729                                                              buffer_gl*_inv.
12730                                                            - Ensures the
12731                                                              atomicrmw has
12732                                                              completed before
12733                                                              invalidating the
12734                                                              caches.
12735
12736                                                          4. buffer_gl0_inv;
12737                                                             buffer_gl1_inv
12738
12739                                                            - Must happen before
12740                                                              any following
12741                                                              global/generic
12742                                                              load/load
12743                                                              atomic/atomicrmw.
12744                                                            - Ensures that
12745                                                              following loads
12746                                                              will not see stale
12747                                                              global data.
12748
12749      fence        acq_rel      - singlethread *none*     *none*
12750                                - wavefront
12751      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
12752                                                             vmcnt(0) & vscnt(0)
12753
12754                                                            - If CU wavefront execution
12755                                                              mode, omit vmcnt(0) and
12756                                                              vscnt(0).
12757                                                            - If OpenCL and
12758                                                              address space is
12759                                                              not generic, omit
12760                                                              lgkmcnt(0).
12761                                                            - If OpenCL and
12762                                                              address space is
12763                                                              local, omit
12764                                                              vmcnt(0) and vscnt(0).
12765                                                            - However,
12766                                                              since LLVM
12767                                                              currently has no
12768                                                              address space on
12769                                                              the fence need to
12770                                                              conservatively
12771                                                              always generate
12772                                                              (see comment for
12773                                                              previous fence).
12774                                                            - Could be split into
12775                                                              separate s_waitcnt
12776                                                              vmcnt(0), s_waitcnt
12777                                                              vscnt(0) and s_waitcnt
12778                                                              lgkmcnt(0) to allow
12779                                                              them to be
12780                                                              independently moved
12781                                                              according to the
12782                                                              following rules.
12783                                                            - s_waitcnt vmcnt(0)
12784                                                              must happen after
12785                                                              any preceding
12786                                                              global/generic
12787                                                              load/load
12788                                                              atomic/
12789                                                              atomicrmw-with-return-value.
12790                                                            - s_waitcnt vscnt(0)
12791                                                              must happen after
12792                                                              any preceding
12793                                                              global/generic
12794                                                              store/store atomic/
12795                                                              atomicrmw-no-return-value.
12796                                                            - s_waitcnt lgkmcnt(0)
12797                                                              must happen after
12798                                                              any preceding
12799                                                              local/generic
12800                                                              load/store/load
12801                                                              atomic/store atomic/
12802                                                              atomicrmw.
12803                                                            - Must happen before
12804                                                              any following
12805                                                              global/generic
12806                                                              load/load
12807                                                              atomic/store/store
12808                                                              atomic/atomicrmw.
12809                                                            - Ensures that all
12810                                                              memory operations
12811                                                              have
12812                                                              completed before
12813                                                              performing any
12814                                                              following global
12815                                                              memory operations.
12816                                                            - Ensures that the
12817                                                              preceding
12818                                                              local/generic load
12819                                                              atomic/atomicrmw
12820                                                              with an equal or
12821                                                              wider sync scope
12822                                                              and memory ordering
12823                                                              stronger than
12824                                                              unordered (this is
12825                                                              termed the
12826                                                              acquire-fence-paired-atomic)
12827                                                              has completed
12828                                                              before following
12829                                                              global memory
12830                                                              operations. This
12831                                                              satisfies the
12832                                                              requirements of
12833                                                              acquire.
12834                                                            - Ensures that all
12835                                                              previous memory
12836                                                              operations have
12837                                                              completed before a
12838                                                              following
12839                                                              local/generic store
12840                                                              atomic/atomicrmw
12841                                                              with an equal or
12842                                                              wider sync scope
12843                                                              and memory ordering
12844                                                              stronger than
12845                                                              unordered (this is
12846                                                              termed the
12847                                                              release-fence-paired-atomic).
12848                                                              This satisfies the
12849                                                              requirements of
12850                                                              release.
12851                                                            - Must happen before
12852                                                              the following
12853                                                              buffer_gl0_inv.
12854                                                            - Ensures that the
12855                                                              acquire-fence-paired
12856                                                              atomic has completed
12857                                                              before invalidating
12858                                                              the
12859                                                              cache. Therefore
12860                                                              any following
12861                                                              locations read must
12862                                                              be no older than
12863                                                              the value read by
12864                                                              the
12865                                                              acquire-fence-paired-atomic.
12866
12867                                                          3. buffer_gl0_inv
12868
12869                                                            - If CU wavefront execution
12870                                                              mode, omit.
12871                                                            - Ensures that
12872                                                              following
12873                                                              loads will not see
12874                                                              stale data.
12875
12876      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
12877                                - system                     vmcnt(0) & vscnt(0)
12878
12879                                                            - If OpenCL and
12880                                                              address space is
12881                                                              not generic, omit
12882                                                              lgkmcnt(0).
12883                                                            - If OpenCL and
12884                                                              address space is
12885                                                              local, omit
12886                                                              vmcnt(0) and vscnt(0).
12887                                                            - However, since LLVM
12888                                                              currently has no
12889                                                              address space on
12890                                                              the fence need to
12891                                                              conservatively
12892                                                              always generate
12893                                                              (see comment for
12894                                                              previous fence).
12895                                                            - Could be split into
12896                                                              separate s_waitcnt
12897                                                              vmcnt(0), s_waitcnt
12898                                                              vscnt(0) and s_waitcnt
12899                                                              lgkmcnt(0) to allow
12900                                                              them to be
12901                                                              independently moved
12902                                                              according to the
12903                                                              following rules.
12904                                                            - s_waitcnt vmcnt(0)
12905                                                              must happen after
12906                                                              any preceding
12907                                                              global/generic
12908                                                              load/load
12909                                                              atomic/
12910                                                              atomicrmw-with-return-value.
12911                                                            - s_waitcnt vscnt(0)
12912                                                              must happen after
12913                                                              any preceding
12914                                                              global/generic
12915                                                              store/store atomic/
12916                                                              atomicrmw-no-return-value.
12917                                                            - s_waitcnt lgkmcnt(0)
12918                                                              must happen after
12919                                                              any preceding
12920                                                              local/generic
12921                                                              load/store/load
12922                                                              atomic/store
12923                                                              atomic/atomicrmw.
12924                                                            - Must happen before
12925                                                              the following
12926                                                              buffer_gl*_inv.
12927                                                            - Ensures that the
12928                                                              preceding
12929                                                              global/local/generic
12930                                                              load
12931                                                              atomic/atomicrmw
12932                                                              with an equal or
12933                                                              wider sync scope
12934                                                              and memory ordering
12935                                                              stronger than
12936                                                              unordered (this is
12937                                                              termed the
12938                                                              acquire-fence-paired-atomic)
12939                                                              has completed
12940                                                              before invalidating
12941                                                              the caches. This
12942                                                              satisfies the
12943                                                              requirements of
12944                                                              acquire.
12945                                                            - Ensures that all
12946                                                              previous memory
12947                                                              operations have
12948                                                              completed before a
12949                                                              following
12950                                                              global/local/generic
12951                                                              store
12952                                                              atomic/atomicrmw
12953                                                              with an equal or
12954                                                              wider sync scope
12955                                                              and memory ordering
12956                                                              stronger than
12957                                                              unordered (this is
12958                                                              termed the
12959                                                              release-fence-paired-atomic).
12960                                                              This satisfies the
12961                                                              requirements of
12962                                                              release.
12963
12964                                                          2. buffer_gl0_inv;
12965                                                             buffer_gl1_inv
12966
12967                                                            - Must happen before
12968                                                              any following
12969                                                              global/generic
12970                                                              load/load
12971                                                              atomic/store/store
12972                                                              atomic/atomicrmw.
12973                                                            - Ensures that
12974                                                              following loads
12975                                                              will not see stale
12976                                                              global data. This
12977                                                              satisfies the
12978                                                              requirements of
12979                                                              acquire.
12980
12981      **Sequential Consistent Atomic**
12982      ------------------------------------------------------------------------------------
12983      load atomic  seq_cst      - singlethread - global   *Same as corresponding
12984                                - wavefront    - local    load atomic acquire,
12985                                               - generic  except must generate
12986                                                          all instructions even
12987                                                          for OpenCL.*
12988      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
12989                                               - generic     vmcnt(0) & vscnt(0)
12990
12991                                                            - If CU wavefront execution
12992                                                              mode, omit vmcnt(0) and
12993                                                              vscnt(0).
12994                                                            - Could be split into
12995                                                              separate s_waitcnt
12996                                                              vmcnt(0), s_waitcnt
12997                                                              vscnt(0), and s_waitcnt
12998                                                              lgkmcnt(0) to allow
12999                                                              them to be
13000                                                              independently moved
13001                                                              according to the
13002                                                              following rules.
13003                                                            - s_waitcnt lgkmcnt(0) must
13004                                                              happen after
13005                                                              preceding
13006                                                              local/generic load
13007                                                              atomic/store
13008                                                              atomic/atomicrmw
13009                                                              with memory
13010                                                              ordering of seq_cst
13011                                                              and with equal or
13012                                                              wider sync scope.
13013                                                              (Note that seq_cst
13014                                                              fences have their
13015                                                              own s_waitcnt
13016                                                              lgkmcnt(0) and so do
13017                                                              not need to be
13018                                                              considered.)
13019                                                            - s_waitcnt vmcnt(0)
13020                                                              must happen after
13021                                                              preceding
13022                                                              global/generic load
13023                                                              atomic/
13024                                                              atomicrmw-with-return-value
13025                                                              with memory
13026                                                              ordering of seq_cst
13027                                                              and with equal or
13028                                                              wider sync scope.
13029                                                              (Note that seq_cst
13030                                                              fences have their
13031                                                              own s_waitcnt
13032                                                              vmcnt(0) and so do
13033                                                              not need to be
13034                                                              considered.)
13035                                                            - s_waitcnt vscnt(0)
13036                                                              Must happen after
13037                                                              preceding
13038                                                              global/generic store
13039                                                              atomic/
13040                                                              atomicrmw-no-return-value
13041                                                              with memory
13042                                                              ordering of seq_cst
13043                                                              and with equal or
13044                                                              wider sync scope.
13045                                                              (Note that seq_cst
13046                                                              fences have their
13047                                                              own s_waitcnt
13048                                                              vscnt(0) and so do
13049                                                              not need to be
13050                                                              considered.)
13051                                                            - Ensures any
13052                                                              preceding
13053                                                              sequential
13054                                                              consistent global/local
13055                                                              memory instructions
13056                                                              have completed
13057                                                              before executing
13058                                                              this sequentially
13059                                                              consistent
13060                                                              instruction. This
13061                                                              prevents reordering
13062                                                              a seq_cst store
13063                                                              followed by a
13064                                                              seq_cst load. (Note
13065                                                              that seq_cst is
13066                                                              stronger than
13067                                                              acquire/release as
13068                                                              the reordering of
13069                                                              load acquire
13070                                                              followed by a store
13071                                                              release is
13072                                                              prevented by the
13073                                                              s_waitcnt of
13074                                                              the release, but
13075                                                              there is nothing
13076                                                              preventing a store
13077                                                              release followed by
13078                                                              load acquire from
13079                                                              completing out of
13080                                                              order. The s_waitcnt
13081                                                              could be placed after
13082                                                              seq_store or before
13083                                                              the seq_load. We
13084                                                              choose the load to
13085                                                              make the s_waitcnt be
13086                                                              as late as possible
13087                                                              so that the store
13088                                                              may have already
13089                                                              completed.)
13090
13091                                                          2. *Following
13092                                                             instructions same as
13093                                                             corresponding load
13094                                                             atomic acquire,
13095                                                             except must generate
13096                                                             all instructions even
13097                                                             for OpenCL.*
13098      load atomic  seq_cst      - workgroup    - local
13099
13100                                                          1. s_waitcnt vmcnt(0) & vscnt(0)
13101
13102                                                            - If CU wavefront execution
13103                                                              mode, omit.
13104                                                            - Could be split into
13105                                                              separate s_waitcnt
13106                                                              vmcnt(0) and s_waitcnt
13107                                                              vscnt(0) to allow
13108                                                              them to be
13109                                                              independently moved
13110                                                              according to the
13111                                                              following rules.
13112                                                            - s_waitcnt vmcnt(0)
13113                                                              Must happen after
13114                                                              preceding
13115                                                              global/generic load
13116                                                              atomic/
13117                                                              atomicrmw-with-return-value
13118                                                              with memory
13119                                                              ordering of seq_cst
13120                                                              and with equal or
13121                                                              wider sync scope.
13122                                                              (Note that seq_cst
13123                                                              fences have their
13124                                                              own s_waitcnt
13125                                                              vmcnt(0) and so do
13126                                                              not need to be
13127                                                              considered.)
13128                                                            - s_waitcnt vscnt(0)
13129                                                              Must happen after
13130                                                              preceding
13131                                                              global/generic store
13132                                                              atomic/
13133                                                              atomicrmw-no-return-value
13134                                                              with memory
13135                                                              ordering of seq_cst
13136                                                              and with equal or
13137                                                              wider sync scope.
13138                                                              (Note that seq_cst
13139                                                              fences have their
13140                                                              own s_waitcnt
13141                                                              vscnt(0) and so do
13142                                                              not need to be
13143                                                              considered.)
13144                                                            - Ensures any
13145                                                              preceding
13146                                                              sequential
13147                                                              consistent global
13148                                                              memory instructions
13149                                                              have completed
13150                                                              before executing
13151                                                              this sequentially
13152                                                              consistent
13153                                                              instruction. This
13154                                                              prevents reordering
13155                                                              a seq_cst store
13156                                                              followed by a
13157                                                              seq_cst load. (Note
13158                                                              that seq_cst is
13159                                                              stronger than
13160                                                              acquire/release as
13161                                                              the reordering of
13162                                                              load acquire
13163                                                              followed by a store
13164                                                              release is
13165                                                              prevented by the
13166                                                              s_waitcnt of
13167                                                              the release, but
13168                                                              there is nothing
13169                                                              preventing a store
13170                                                              release followed by
13171                                                              load acquire from
13172                                                              completing out of
13173                                                              order. The s_waitcnt
13174                                                              could be placed after
13175                                                              seq_store or before
13176                                                              the seq_load. We
13177                                                              choose the load to
13178                                                              make the s_waitcnt be
13179                                                              as late as possible
13180                                                              so that the store
13181                                                              may have already
13182                                                              completed.)
13183
13184                                                          2. *Following
13185                                                             instructions same as
13186                                                             corresponding load
13187                                                             atomic acquire,
13188                                                             except must generate
13189                                                             all instructions even
13190                                                             for OpenCL.*
13191
13192      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
13193                                - system       - generic     vmcnt(0) & vscnt(0)
13194
13195                                                            - Could be split into
13196                                                              separate s_waitcnt
13197                                                              vmcnt(0), s_waitcnt
13198                                                              vscnt(0) and s_waitcnt
13199                                                              lgkmcnt(0) to allow
13200                                                              them to be
13201                                                              independently moved
13202                                                              according to the
13203                                                              following rules.
13204                                                            - s_waitcnt lgkmcnt(0)
13205                                                              must happen after
13206                                                              preceding
13207                                                              local load
13208                                                              atomic/store
13209                                                              atomic/atomicrmw
13210                                                              with memory
13211                                                              ordering of seq_cst
13212                                                              and with equal or
13213                                                              wider sync scope.
13214                                                              (Note that seq_cst
13215                                                              fences have their
13216                                                              own s_waitcnt
13217                                                              lgkmcnt(0) and so do
13218                                                              not need to be
13219                                                              considered.)
13220                                                            - s_waitcnt vmcnt(0)
13221                                                              must happen after
13222                                                              preceding
13223                                                              global/generic load
13224                                                              atomic/
13225                                                              atomicrmw-with-return-value
13226                                                              with memory
13227                                                              ordering of seq_cst
13228                                                              and with equal or
13229                                                              wider sync scope.
13230                                                              (Note that seq_cst
13231                                                              fences have their
13232                                                              own s_waitcnt
13233                                                              vmcnt(0) and so do
13234                                                              not need to be
13235                                                              considered.)
13236                                                            - s_waitcnt vscnt(0)
13237                                                              Must happen after
13238                                                              preceding
13239                                                              global/generic store
13240                                                              atomic/
13241                                                              atomicrmw-no-return-value
13242                                                              with memory
13243                                                              ordering of seq_cst
13244                                                              and with equal or
13245                                                              wider sync scope.
13246                                                              (Note that seq_cst
13247                                                              fences have their
13248                                                              own s_waitcnt
13249                                                              vscnt(0) and so do
13250                                                              not need to be
13251                                                              considered.)
13252                                                            - Ensures any
13253                                                              preceding
13254                                                              sequential
13255                                                              consistent global
13256                                                              memory instructions
13257                                                              have completed
13258                                                              before executing
13259                                                              this sequentially
13260                                                              consistent
13261                                                              instruction. This
13262                                                              prevents reordering
13263                                                              a seq_cst store
13264                                                              followed by a
13265                                                              seq_cst load. (Note
13266                                                              that seq_cst is
13267                                                              stronger than
13268                                                              acquire/release as
13269                                                              the reordering of
13270                                                              load acquire
13271                                                              followed by a store
13272                                                              release is
13273                                                              prevented by the
13274                                                              s_waitcnt of
13275                                                              the release, but
13276                                                              there is nothing
13277                                                              preventing a store
13278                                                              release followed by
13279                                                              load acquire from
13280                                                              completing out of
13281                                                              order. The s_waitcnt
13282                                                              could be placed after
13283                                                              seq_store or before
13284                                                              the seq_load. We
13285                                                              choose the load to
13286                                                              make the s_waitcnt be
13287                                                              as late as possible
13288                                                              so that the store
13289                                                              may have already
13290                                                              completed.)
13291
13292                                                          2. *Following
13293                                                             instructions same as
13294                                                             corresponding load
13295                                                             atomic acquire,
13296                                                             except must generate
13297                                                             all instructions even
13298                                                             for OpenCL.*
13299      store atomic seq_cst      - singlethread - global   *Same as corresponding
13300                                - wavefront    - local    store atomic release,
13301                                - workgroup    - generic  except must generate
13302                                - agent                   all instructions even
13303                                - system                  for OpenCL.*
13304      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
13305                                - wavefront    - local    atomicrmw acq_rel,
13306                                - workgroup    - generic  except must generate
13307                                - agent                   all instructions even
13308                                - system                  for OpenCL.*
13309      fence        seq_cst      - singlethread *none*     *Same as corresponding
13310                                - wavefront               fence acq_rel,
13311                                - workgroup               except must generate
13312                                - agent                   all instructions even
13313                                - system                  for OpenCL.*
13314      ============ ============ ============== ========== ================================
13315
13316 .. _amdgpu-amdhsa-trap-handler-abi:
13317
13318 Trap Handler ABI
13319 ~~~~~~~~~~~~~~~~
13320
13321 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
13322 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
13323 supports the ``s_trap`` instruction. For usage see:
13324
13325 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
13326 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
13327 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
13328
13329   .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
13330      :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
13331
13332      =================== =============== =============== =======================================
13333      Usage               Code Sequence   Trap Handler    Description
13334                                          Inputs
13335      =================== =============== =============== =======================================
13336      reserved            ``s_trap 0x00``                 Reserved by hardware.
13337      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for Finalizer HSA ``debugtrap``
13338                                            ``queue_ptr`` intrinsic (not implemented).
13339                                          ``VGPR0``:
13340                                            ``arg``
13341      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
13342                                            ``queue_ptr`` the trap instruction. The associated
13343                                                          queue is signalled to put it into the
13344                                                          error state.  When the queue is put in
13345                                                          the error state, the waves executing
13346                                                          dispatches on the queue will be
13347                                                          terminated.
13348      ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
13349                                                            as a no-operation. The trap handler
13350                                                            is entered and immediately returns to
13351                                                            continue execution of the wavefront.
13352                                                          - If the debugger is enabled, causes
13353                                                            the debug trap to be reported by the
13354                                                            debugger and the wavefront is put in
13355                                                            the halt state with the PC at the
13356                                                            instruction.  The debugger must
13357                                                            increment the PC and resume the wave.
13358      reserved            ``s_trap 0x04``                 Reserved.
13359      reserved            ``s_trap 0x05``                 Reserved.
13360      reserved            ``s_trap 0x06``                 Reserved.
13361      reserved            ``s_trap 0x07``                 Reserved.
13362      reserved            ``s_trap 0x08``                 Reserved.
13363      reserved            ``s_trap 0xfe``                 Reserved.
13364      reserved            ``s_trap 0xff``                 Reserved.
13365      =================== =============== =============== =======================================
13366
13367 ..
13368
13369   .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
13370      :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
13371
13372      =================== =============== =============== =======================================
13373      Usage               Code Sequence   Trap Handler    Description
13374                                          Inputs
13375      =================== =============== =============== =======================================
13376      reserved            ``s_trap 0x00``                 Reserved by hardware.
13377      debugger breakpoint ``s_trap 0x01`` *none*          Reserved for debugger to use for
13378                                                          breakpoints. Causes wave to be halted
13379                                                          with the PC at the trap instruction.
13380                                                          The debugger is responsible to resume
13381                                                          the wave, including the instruction
13382                                                          that the breakpoint overwrote.
13383      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
13384                                            ``queue_ptr`` the trap instruction. The associated
13385                                                          queue is signalled to put it into the
13386                                                          error state.  When the queue is put in
13387                                                          the error state, the waves executing
13388                                                          dispatches on the queue will be
13389                                                          terminated.
13390      ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
13391                                                            as a no-operation. The trap handler
13392                                                            is entered and immediately returns to
13393                                                            continue execution of the wavefront.
13394                                                          - If the debugger is enabled, causes
13395                                                            the debug trap to be reported by the
13396                                                            debugger and the wavefront is put in
13397                                                            the halt state with the PC at the
13398                                                            instruction.  The debugger must
13399                                                            increment the PC and resume the wave.
13400      reserved            ``s_trap 0x04``                 Reserved.
13401      reserved            ``s_trap 0x05``                 Reserved.
13402      reserved            ``s_trap 0x06``                 Reserved.
13403      reserved            ``s_trap 0x07``                 Reserved.
13404      reserved            ``s_trap 0x08``                 Reserved.
13405      reserved            ``s_trap 0xfe``                 Reserved.
13406      reserved            ``s_trap 0xff``                 Reserved.
13407      =================== =============== =============== =======================================
13408
13409 ..
13410
13411   .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
13412      :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
13413
13414      =================== =============== ================ ================= =======================================
13415      Usage               Code Sequence   GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
13416      =================== =============== ================ ================= =======================================
13417      reserved            ``s_trap 0x00``                                    Reserved by hardware.
13418      debugger breakpoint ``s_trap 0x01`` *none*           *none*            Reserved for debugger to use for
13419                                                                             breakpoints. Causes wave to be halted
13420                                                                             with the PC at the trap instruction.
13421                                                                             The debugger is responsible to resume
13422                                                                             the wave, including the instruction
13423                                                                             that the breakpoint overwrote.
13424      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:     *none*            Causes wave to be halted with the PC at
13425                                            ``queue_ptr``                    the trap instruction. The associated
13426                                                                             queue is signalled to put it into the
13427                                                                             error state.  When the queue is put in
13428                                                                             the error state, the waves executing
13429                                                                             dispatches on the queue will be
13430                                                                             terminated.
13431      ``llvm.debugtrap``  ``s_trap 0x03`` *none*           *none*            - If debugger not enabled then behaves
13432                                                                               as a no-operation. The trap handler
13433                                                                               is entered and immediately returns to
13434                                                                               continue execution of the wavefront.
13435                                                                             - If the debugger is enabled, causes
13436                                                                               the debug trap to be reported by the
13437                                                                               debugger and the wavefront is put in
13438                                                                               the halt state with the PC at the
13439                                                                               instruction.  The debugger must
13440                                                                               increment the PC and resume the wave.
13441      reserved            ``s_trap 0x04``                                    Reserved.
13442      reserved            ``s_trap 0x05``                                    Reserved.
13443      reserved            ``s_trap 0x06``                                    Reserved.
13444      reserved            ``s_trap 0x07``                                    Reserved.
13445      reserved            ``s_trap 0x08``                                    Reserved.
13446      reserved            ``s_trap 0xfe``                                    Reserved.
13447      reserved            ``s_trap 0xff``                                    Reserved.
13448      =================== =============== ================ ================= =======================================
13449
13450 .. _amdgpu-amdhsa-function-call-convention:
13451
13452 Call Convention
13453 ~~~~~~~~~~~~~~~
13454
13455 .. note::
13456
13457   This section is currently incomplete and has inaccuracies. It is WIP that will
13458   be updated as information is determined.
13459
13460 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
13461 addresses. Unswizzled addresses are normal linear addresses.
13462
13463 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
13464
13465 Kernel Functions
13466 ++++++++++++++++
13467
13468 This section describes the call convention ABI for the outer kernel function.
13469
13470 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
13471 convention.
13472
13473 The following is not part of the AMDGPU kernel calling convention but describes
13474 how the AMDGPU implements function calls:
13475
13476 1.  Clang decides the kernarg layout to match the *HSA Programmer's Language
13477     Reference* [HSA]_.
13478
13479     - All structs are passed directly.
13480     - Lambda values are passed *TBA*.
13481
13482     .. TODO::
13483
13484       - Does this really follow HSA rules? Or are structs >16 bytes passed
13485         by-value struct?
13486       - What is ABI for lambda values?
13487
13488 4.  The kernel performs certain setup in its prolog, as described in
13489     :ref:`amdgpu-amdhsa-kernel-prolog`.
13490
13491 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
13492
13493 Non-Kernel Functions
13494 ++++++++++++++++++++
13495
13496 This section describes the call convention ABI for functions other than the
13497 outer kernel function.
13498
13499 If a kernel has function calls then scratch is always allocated and used for
13500 the call stack which grows from low address to high address using the swizzled
13501 scratch address space.
13502
13503 On entry to a function:
13504
13505 1.  SGPR0-3 contain a V# with the following properties (see
13506     :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
13507
13508     * Base address pointing to the beginning of the wavefront scratch backing
13509       memory.
13510     * Swizzled with dword element size and stride of wavefront size elements.
13511
13512 2.  The FLAT_SCRATCH register pair is setup. See
13513     :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
13514 3.  GFX6-GFX8: M0 register set to the size of LDS in bytes. See
13515     :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
13516 4.  The EXEC register is set to the lanes active on entry to the function.
13517 5.  MODE register: *TBD*
13518 6.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
13519     below.
13520 7.  SGPR30-31 return address (RA). The code address that the function must
13521     return to when it completes. The value is undefined if the function is *no
13522     return*.
13523 8.  SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
13524     offset relative to the beginning of the wavefront scratch backing memory.
13525
13526     The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
13527     offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
13528     manner.
13529
13530     The unswizzled SP value can be converted into the swizzled SP value by:
13531
13532       | swizzled SP = unswizzled SP / wavefront size
13533
13534     This may be used to obtain the private address space address of stack
13535     objects and to convert this address to a flat address by adding the flat
13536     scratch aperture base address.
13537
13538     The swizzled SP value is always 4 bytes aligned for the ``r600``
13539     architecture and 16 byte aligned for the ``amdgcn`` architecture.
13540
13541     .. note::
13542
13543       The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
13544       OpenCL language which has the largest base type defined as 16 bytes.
13545
13546     On entry, the swizzled SP value is the address of the first function
13547     argument passed on the stack. Other stack passed arguments are positive
13548     offsets from the entry swizzled SP value.
13549
13550     The function may use positive offsets beyond the last stack passed argument
13551     for stack allocated local variables and register spill slots. If necessary,
13552     the function may align these to greater alignment than 16 bytes. After these
13553     the function may dynamically allocate space for such things as runtime sized
13554     ``alloca`` local allocations.
13555
13556     If the function calls another function, it will place any stack allocated
13557     arguments after the last local allocation and adjust SGPR32 to the address
13558     after the last local allocation.
13559
13560 9.  All other registers are unspecified.
13561 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
13562     to the function.
13563
13564 On exit from a function:
13565
13566 1.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
13567     described below. Any registers used are considered clobbered registers.
13568 2.  The following registers are preserved and have the same value as on entry:
13569
13570     * FLAT_SCRATCH
13571     * EXEC
13572     * GFX6-GFX8: M0
13573     * All SGPR registers except the clobbered registers of SGPR4-31.
13574     * VGPR40-47
13575     * VGPR56-63
13576     * VGPR72-79
13577     * VGPR88-95
13578     * VGPR104-111
13579     * VGPR120-127
13580     * VGPR136-143
13581     * VGPR152-159
13582     * VGPR168-175
13583     * VGPR184-191
13584     * VGPR200-207
13585     * VGPR216-223
13586     * VGPR232-239
13587     * VGPR248-255
13588
13589         .. note::
13590
13591           Except the argument registers, the VGPRs clobbered and the preserved
13592           registers are intermixed at regular intervals in order to keep a
13593           similar ratio independent of the number of allocated VGPRs.
13594
13595     * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
13596     * Lanes of all VGPRs that are inactive at the call site.
13597
13598       For the AMDGPU backend, an inter-procedural register allocation (IPRA)
13599       optimization may mark some of clobbered SGPR and VGPR registers as
13600       preserved if it can be determined that the called function does not change
13601       their value.
13602
13603 2.  The PC is set to the RA provided on entry.
13604 3.  MODE register: *TBD*.
13605 4.  All other registers are clobbered.
13606 5.  Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
13607     function is available to the caller.
13608
13609 .. TODO::
13610
13611   - How are function results returned? The address of structured types is passed
13612     by reference, but what about other types?
13613
13614 The function input arguments are made up of the formal arguments explicitly
13615 declared by the source language function plus the implicit input arguments used
13616 by the implementation.
13617
13618 The source language input arguments are:
13619
13620 1. Any source language implicit ``this`` or ``self`` argument comes first as a
13621    pointer type.
13622 2. Followed by the function formal arguments in left to right source order.
13623
13624 The source language result arguments are:
13625
13626 1. The function result argument.
13627
13628 The source language input or result struct type arguments that are less than or
13629 equal to 16 bytes, are decomposed recursively into their base type fields, and
13630 each field is passed as if a separate argument. For input arguments, if the
13631 called function requires the struct to be in memory, for example because its
13632 address is taken, then the function body is responsible for allocating a stack
13633 location and copying the field arguments into it. Clang terms this *direct
13634 struct*.
13635
13636 The source language input struct type arguments that are greater than 16 bytes,
13637 are passed by reference. The caller is responsible for allocating a stack
13638 location to make a copy of the struct value and pass the address as the input
13639 argument. The called function is responsible to perform the dereference when
13640 accessing the input argument. Clang terms this *by-value struct*.
13641
13642 A source language result struct type argument that is greater than 16 bytes, is
13643 returned by reference. The caller is responsible for allocating a stack location
13644 to hold the result value and passes the address as the last input argument
13645 (before the implicit input arguments). In this case there are no result
13646 arguments. The called function is responsible to perform the dereference when
13647 storing the result value. Clang terms this *structured return (sret)*.
13648
13649 *TODO: correct the ``sret`` definition.*
13650
13651 .. TODO::
13652
13653   Is this definition correct? Or is ``sret`` only used if passing in registers, and
13654   pass as non-decomposed struct as stack argument? Or something else? Is the
13655   memory location in the caller stack frame, or a stack memory argument and so
13656   no address is passed as the caller can directly write to the argument stack
13657   location? But then the stack location is still live after return. If an
13658   argument stack location is it the first stack argument or the last one?
13659
13660 Lambda argument types are treated as struct types with an implementation defined
13661 set of fields.
13662
13663 .. TODO::
13664
13665   Need to specify the ABI for lambda types for AMDGPU.
13666
13667 For AMDGPU backend all source language arguments (including the decomposed
13668 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
13669 they are passed in SGPRs.
13670
13671 The AMDGPU backend walks the function call graph from the leaves to determine
13672 which implicit input arguments are used, propagating to each caller of the
13673 function. The used implicit arguments are appended to the function arguments
13674 after the source language arguments in the following order:
13675
13676 .. TODO::
13677
13678   Is recursion or external functions supported?
13679
13680 1.  Work-Item ID (1 VGPR)
13681
13682     The X, Y and Z work-item ID are packed into a single VGRP with the following
13683     layout. Only fields actually used by the function are set. The other bits
13684     are undefined.
13685
13686     The values come from the initial kernel execution state. See
13687     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13688
13689     .. table:: Work-item implicit argument layout
13690       :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
13691
13692       ======= ======= ==============
13693       Bits    Size    Field Name
13694       ======= ======= ==============
13695       9:0     10 bits X Work-Item ID
13696       19:10   10 bits Y Work-Item ID
13697       29:20   10 bits Z Work-Item ID
13698       31:30   2 bits  Unused
13699       ======= ======= ==============
13700
13701 2.  Dispatch Ptr (2 SGPRs)
13702
13703     The value comes from the initial kernel execution state. See
13704     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13705
13706 3.  Queue Ptr (2 SGPRs)
13707
13708     The value comes from the initial kernel execution state. See
13709     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13710
13711 4.  Kernarg Segment Ptr (2 SGPRs)
13712
13713     The value comes from the initial kernel execution state. See
13714     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13715
13716 5.  Dispatch id (2 SGPRs)
13717
13718     The value comes from the initial kernel execution state. See
13719     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13720
13721 6.  Work-Group ID X (1 SGPR)
13722
13723     The value comes from the initial kernel execution state. See
13724     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13725
13726 7.  Work-Group ID Y (1 SGPR)
13727
13728     The value comes from the initial kernel execution state. See
13729     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13730
13731 8.  Work-Group ID Z (1 SGPR)
13732
13733     The value comes from the initial kernel execution state. See
13734     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13735
13736 9.  Implicit Argument Ptr (2 SGPRs)
13737
13738     The value is computed by adding an offset to Kernarg Segment Ptr to get the
13739     global address space pointer to the first kernarg implicit argument.
13740
13741 The input and result arguments are assigned in order in the following manner:
13742
13743 .. note::
13744
13745   There are likely some errors and omissions in the following description that
13746   need correction.
13747
13748   .. TODO::
13749
13750     Check the Clang source code to decipher how function arguments and return
13751     results are handled. Also see the AMDGPU specific values used.
13752
13753 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
13754   VGPR31.
13755
13756   If there are more arguments than will fit in these registers, the remaining
13757   arguments are allocated on the stack in order on naturally aligned
13758   addresses.
13759
13760   .. TODO::
13761
13762     How are overly aligned structures allocated on the stack?
13763
13764 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
13765   SGPR29.
13766
13767   If there are more arguments than will fit in these registers, the remaining
13768   arguments are allocated on the stack in order on naturally aligned
13769   addresses.
13770
13771 Note that decomposed struct type arguments may have some fields passed in
13772 registers and some in memory.
13773
13774 .. TODO::
13775
13776   So, a struct which can pass some fields as decomposed register arguments, will
13777   pass the rest as decomposed stack elements? But an argument that will not start
13778   in registers will not be decomposed and will be passed as a non-decomposed
13779   stack value?
13780
13781 The following is not part of the AMDGPU function calling convention but
13782 describes how the AMDGPU implements function calls:
13783
13784 1.  SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
13785     unswizzled scratch address. It is only needed if runtime sized ``alloca``
13786     are used, or for the reasons defined in ``SIFrameLowering``.
13787 2.  Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
13788     to access the incoming stack arguments in the function. The BP is needed
13789     only when the function requires the runtime stack alignment.
13790
13791 3.  Allocating SGPR arguments on the stack are not supported.
13792
13793 4.  No CFI is currently generated. See
13794     :ref:`amdgpu-dwarf-call-frame-information`.
13795
13796     .. note::
13797
13798       CFI will be generated that defines the CFA as the unswizzled address
13799       relative to the wave scratch base in the unswizzled private address space
13800       of the lowest address stack allocated local variable.
13801
13802       ``DW_AT_frame_base`` will be defined as the swizzled address in the
13803       swizzled private address space by dividing the CFA by the wavefront size
13804       (since CFA is always at least dword aligned which matches the scratch
13805       swizzle element size).
13806
13807       If no dynamic stack alignment was performed, the stack allocated arguments
13808       are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
13809       local variables and register spill slots are accessed as positive offsets
13810       relative to ``DW_AT_frame_base``.
13811
13812 5.  Function argument passing is implemented by copying the input physical
13813     registers to virtual registers on entry. The register allocator can spill if
13814     necessary. These are copied back to physical registers at call sites. The
13815     net effect is that each function call can have these values in entirely
13816     distinct locations. The IPRA can help avoid shuffling argument registers.
13817 6.  Call sites are implemented by setting up the arguments at positive offsets
13818     from SP. Then SP is incremented to account for the known frame size before
13819     the call and decremented after the call.
13820
13821     .. note::
13822
13823       The CFI will reflect the changed calculation needed to compute the CFA
13824       from SP.
13825
13826 7.  4 byte spill slots are used in the stack frame. One slot is allocated for an
13827     emergency spill slot. Buffer instructions are used for stack accesses and
13828     not the ``flat_scratch`` instruction.
13829
13830     .. TODO::
13831
13832       Explain when the emergency spill slot is used.
13833
13834 .. TODO::
13835
13836   Possible broken issues:
13837
13838   - Stack arguments must be aligned to required alignment.
13839   - Stack is aligned to max(16, max formal argument alignment)
13840   - Direct argument < 64 bits should check register budget.
13841   - Register budget calculation should respect ``inreg`` for SGPR.
13842   - SGPR overflow is not handled.
13843   - struct with 1 member unpeeling is not checking size of member.
13844   - ``sret`` is after ``this`` pointer.
13845   - Caller is not implementing stack realignment: need an extra pointer.
13846   - Should say AMDGPU passes FP rather than SP.
13847   - Should CFI define CFA as address of locals or arguments. Difference is
13848     apparent when have implemented dynamic alignment.
13849   - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
13850     highest address of stack frame and use negative offset for locals. Would
13851     allow SP to be the same as FP and could support signal-handler-like as now
13852     have a real SP for the top of the stack.
13853   - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
13854     arguments?
13855
13856 AMDPAL
13857 ------
13858
13859 This section provides code conventions used when the target triple OS is
13860 ``amdpal`` (see :ref:`amdgpu-target-triples`).
13861
13862 .. _amdgpu-amdpal-code-object-metadata-section:
13863
13864 Code Object Metadata
13865 ~~~~~~~~~~~~~~~~~~~~
13866
13867 .. note::
13868
13869   The metadata is currently in development and is subject to major
13870   changes. Only the current version is supported. *When this document
13871   was generated the version was 2.6.*
13872
13873 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
13874 record (see :ref:`amdgpu-note-records-v3-onwards`).
13875
13876 The metadata is represented as Message Pack formatted binary data (see
13877 [MsgPack]_). The top level is a Message Pack map that includes the keys
13878 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
13879 and referenced tables.
13880
13881 Additional information can be added to the maps. To avoid conflicts, any
13882 key names should be prefixed by "*vendor-name*." where ``vendor-name``
13883 can be the name of the vendor and specific vendor tool that generates the
13884 information. The prefix is abbreviated to simply "." when it appears
13885 within a map that has been added by the same *vendor-name*.
13886
13887   .. table:: AMDPAL Code Object Metadata Map
13888      :name: amdgpu-amdpal-code-object-metadata-map-table
13889
13890      =================== ============== ========= ======================================================================
13891      String Key          Value Type     Required? Description
13892      =================== ============== ========= ======================================================================
13893      "amdpal.version"    sequence of    Required  PAL code object metadata (major, minor) version. The current values
13894                          2 integers               are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
13895      "amdpal.pipelines"  sequence of    Required  Per-pipeline metadata. See
13896                          map                      :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
13897                                                   definition of the keys included in that map.
13898      =================== ============== ========= ======================================================================
13899
13900 ..
13901
13902   .. table:: AMDPAL Code Object Pipeline Metadata Map
13903      :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
13904
13905      ====================================== ============== ========= ===================================================
13906      String Key                             Value Type     Required? Description
13907      ====================================== ============== ========= ===================================================
13908      ".name"                                string                   Source name of the pipeline.
13909      ".type"                                string                   Pipeline type, e.g. VsPs. Values include:
13910
13911                                                                        - "VsPs"
13912                                                                        - "Gs"
13913                                                                        - "Cs"
13914                                                                        - "Ngg"
13915                                                                        - "Tess"
13916                                                                        - "GsTess"
13917                                                                        - "NggTess"
13918
13919      ".internal_pipeline_hash"              sequence of    Required  Internal compiler hash for this pipeline. Lower
13920                                             2 integers               64 bits is the "stable" portion of the hash, used
13921                                                                      for e.g. shader replacement lookup. Upper 64 bits
13922                                                                      is the "unique" portion of the hash, used for
13923                                                                      e.g. pipeline cache lookup. The value is
13924                                                                      implementation defined, and can not be relied on
13925                                                                      between different builds of the compiler.
13926      ".shaders"                             map                      Per-API shader metadata. See
13927                                                                      :ref:`amdgpu-amdpal-code-object-shader-map-table`
13928                                                                      for the definition of the keys included in that
13929                                                                      map.
13930      ".hardware_stages"                     map                      Per-hardware stage metadata. See
13931                                                                      :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
13932                                                                      for the definition of the keys included in that
13933                                                                      map.
13934      ".shader_functions"                    map                      Per-shader function metadata. See
13935                                                                      :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
13936                                                                      for the definition of the keys included in that
13937                                                                      map.
13938      ".registers"                           map            Required  Hardware register configuration. See
13939                                                                      :ref:`amdgpu-amdpal-code-object-register-map-table`
13940                                                                      for the definition of the keys included in that
13941                                                                      map.
13942      ".user_data_limit"                     integer                  Number of user data entries accessed by this
13943                                                                      pipeline.
13944      ".spill_threshold"                     integer                  The user data spill threshold.  0xFFFF for
13945                                                                      NoUserDataSpilling.
13946      ".uses_viewport_array_index"           boolean                  Indicates whether or not the pipeline uses the
13947                                                                      viewport array index feature. Pipelines which use
13948                                                                      this feature can render into all 16 viewports,
13949                                                                      whereas pipelines which do not use it are
13950                                                                      restricted to viewport #0.
13951      ".es_gs_lds_size"                      integer                  Size in bytes of LDS space used internally for
13952                                                                      handling data-passing between the ES and GS
13953                                                                      shader stages. This can be zero if the data is
13954                                                                      passed using off-chip buffers. This value should
13955                                                                      be used to program all user-SGPRs which have been
13956                                                                      marked with "UserDataMapping::EsGsLdsSize"
13957                                                                      (typically only the GS and VS HW stages will ever
13958                                                                      have a user-SGPR so marked).
13959      ".nggSubgroupSize"                     integer                  Explicit maximum subgroup size for NGG shaders
13960                                                                      (maximum number of threads in a subgroup).
13961      ".num_interpolants"                    integer                  Graphics only. Number of PS interpolants.
13962      ".mesh_scratch_memory_size"            integer                  Max mesh shader scratch memory used.
13963      ".api"                                 string                   Name of the client graphics API.
13964      ".api_create_info"                     binary                   Graphics API shader create info binary blob. Can
13965                                                                      be defined by the driver using the compiler if
13966                                                                      they want to be able to correlate API-specific
13967                                                                      information used during creation at a later time.
13968      ====================================== ============== ========= ===================================================
13969
13970 ..
13971
13972   .. table:: AMDPAL Code Object Shader Map
13973      :name: amdgpu-amdpal-code-object-shader-map-table
13974
13975
13976      +-------------+--------------+-------------------------------------------------------------------+
13977      |String Key   |Value Type    |Description                                                        |
13978      +=============+==============+===================================================================+
13979      |- ".compute" |map           |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
13980      |- ".vertex"  |              |for the definition of the keys included in that map.               |
13981      |- ".hull"    |              |                                                                   |
13982      |- ".domain"  |              |                                                                   |
13983      |- ".geometry"|              |                                                                   |
13984      |- ".pixel"   |              |                                                                   |
13985      +-------------+--------------+-------------------------------------------------------------------+
13986
13987 ..
13988
13989   .. table:: AMDPAL Code Object API Shader Metadata Map
13990      :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
13991
13992      ==================== ============== ========= =====================================================================
13993      String Key           Value Type     Required? Description
13994      ==================== ============== ========= =====================================================================
13995      ".api_shader_hash"   sequence of    Required  Input shader hash, typically passed in from the client. The value
13996                           2 integers               is implementation defined, and can not be relied on between
13997                                                    different builds of the compiler.
13998      ".hardware_mapping"  sequence of    Required  Flags indicating the HW stages this API shader maps to. Values
13999                           string                   include:
14000
14001                                                      - ".ls"
14002                                                      - ".hs"
14003                                                      - ".es"
14004                                                      - ".gs"
14005                                                      - ".vs"
14006                                                      - ".ps"
14007                                                      - ".cs"
14008
14009      ==================== ============== ========= =====================================================================
14010
14011 ..
14012
14013   .. table:: AMDPAL Code Object Hardware Stage Map
14014      :name: amdgpu-amdpal-code-object-hardware-stage-map-table
14015
14016      +-------------+--------------+-----------------------------------------------------------------------+
14017      |String Key   |Value Type    |Description                                                            |
14018      +=============+==============+=======================================================================+
14019      |- ".ls"      |map           |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
14020      |- ".hs"      |              |for the definition of the keys included in that map.                   |
14021      |- ".es"      |              |                                                                       |
14022      |- ".gs"      |              |                                                                       |
14023      |- ".vs"      |              |                                                                       |
14024      |- ".ps"      |              |                                                                       |
14025      |- ".cs"      |              |                                                                       |
14026      +-------------+--------------+-----------------------------------------------------------------------+
14027
14028 ..
14029
14030   .. table:: AMDPAL Code Object Hardware Stage Metadata Map
14031      :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
14032
14033      ========================== ============== ========= ===============================================================
14034      String Key                 Value Type     Required? Description
14035      ========================== ============== ========= ===============================================================
14036      ".entry_point"             string                   The ELF symbol pointing to this pipeline's stage entry point.
14037      ".scratch_memory_size"     integer                  Scratch memory size in bytes.
14038      ".lds_size"                integer                  Local Data Share size in bytes.
14039      ".perf_data_buffer_size"   integer                  Performance data buffer size in bytes.
14040      ".vgpr_count"              integer                  Number of VGPRs used.
14041      ".agpr_count"              integer                  Number of AGPRs used.
14042      ".sgpr_count"              integer                  Number of SGPRs used.
14043      ".vgpr_limit"              integer                  If non-zero, indicates the shader was compiled with a
14044                                                          directive to instruct the compiler to limit the VGPR usage to
14045                                                          be less than or equal to the specified value (only set if
14046                                                          different from HW default).
14047      ".sgpr_limit"              integer                  SGPR count upper limit (only set if different from HW
14048                                                          default).
14049      ".threadgroup_dimensions"  sequence of              Thread-group X/Y/Z dimensions (Compute only).
14050                                 3 integers
14051      ".wavefront_size"          integer                  Wavefront size (only set if different from HW default).
14052      ".uses_uavs"               boolean                  The shader reads or writes UAVs.
14053      ".uses_rovs"               boolean                  The shader reads or writes ROVs.
14054      ".writes_uavs"             boolean                  The shader writes to one or more UAVs.
14055      ".writes_depth"            boolean                  The shader writes out a depth value.
14056      ".uses_append_consume"     boolean                  The shader uses append and/or consume operations, either
14057                                                          memory or GDS.
14058      ".uses_prim_id"            boolean                  The shader uses PrimID.
14059      ========================== ============== ========= ===============================================================
14060
14061 ..
14062
14063   .. table:: AMDPAL Code Object Shader Function Map
14064      :name: amdgpu-amdpal-code-object-shader-function-map-table
14065
14066      =============== ============== ====================================================================
14067      String Key      Value Type     Description
14068      =============== ============== ====================================================================
14069      *symbol name*   map            *symbol name* is the ELF symbol name of the shader function code
14070                                     entry address. The value is the function's metadata. See
14071                                     :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
14072      =============== ============== ====================================================================
14073
14074 ..
14075
14076   .. table:: AMDPAL Code Object Shader Function Metadata Map
14077      :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
14078
14079      ============================= ============== =================================================================
14080      String Key                    Value Type     Description
14081      ============================= ============== =================================================================
14082      ".api_shader_hash"            sequence of    Input shader hash, typically passed in from the client. The value
14083                                    2 integers     is implementation defined, and can not be relied on between
14084                                                   different builds of the compiler.
14085      ".scratch_memory_size"        integer        Size in bytes of scratch memory used by the shader.
14086      ".lds_size"                   integer        Size in bytes of LDS memory.
14087      ".vgpr_count"                 integer        Number of VGPRs used by the shader.
14088      ".sgpr_count"                 integer        Number of SGPRs used by the shader.
14089      ".stack_frame_size_in_bytes"  integer        Amount of stack size used by the shader.
14090      ".shader_subtype"             string         Shader subtype/kind. Values include:
14091
14092                                                     - "Unknown"
14093
14094      ============================= ============== =================================================================
14095
14096 ..
14097
14098   .. table:: AMDPAL Code Object Register Map
14099      :name: amdgpu-amdpal-code-object-register-map-table
14100
14101      ========================== ============== ====================================================================
14102      32-bit Integer Key         Value Type     Description
14103      ========================== ============== ====================================================================
14104      ``reg offset``             32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
14105                                                a GRBM register (i.e., driver accessible GPU register number, not
14106                                                shader GPR register number). The driver is required to program each
14107                                                specified register to the corresponding specified value when
14108                                                executing this pipeline. Typically, the ``reg offsets`` are the
14109                                                ``uint16_t`` offsets to each register as defined by the hardware
14110                                                chip headers. The register is set to the provided value. However, a
14111                                                ``reg offset`` that specifies a user data register (e.g.,
14112                                                COMPUTE_USER_DATA_0) needs special treatment. See
14113                                                :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
14114                                                information.
14115      ========================== ============== ====================================================================
14116
14117 .. _amdgpu-amdpal-code-object-user-data-section:
14118
14119 User Data
14120 +++++++++
14121
14122 Each hardware stage has a set of 32-bit physical SPI *user data registers*
14123 (either 16 or 32 based on graphics IP and the stage) which can be
14124 written from a command buffer and then loaded into SGPRs when waves are
14125 launched via a subsequent dispatch or draw operation. This is the way
14126 most arguments are passed from the application/runtime to a hardware
14127 shader.
14128
14129 PAL abstracts this functionality by exposing a set of 128 *user data
14130 entries* per pipeline a client can use to pass arguments from a command
14131 buffer to one or more shaders in that pipeline. The ELF code object must
14132 specify a mapping from virtualized *user data entries* to physical *user
14133 data registers*, and PAL is responsible for implementing that mapping,
14134 including spilling overflow *user data entries* to memory if needed.
14135
14136 Since the *user data registers* are GRBM-accessible SPI registers, this
14137 mapping is actually embedded in the ``.registers`` metadata entry. For
14138 most registers, the value in that map is a literal 32-bit value that
14139 should be written to the register by the driver. However, when the
14140 register is a *user data register* (any USER_DATA register e.g.,
14141 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
14142 the driver to write either a *user data entry* value or one of several
14143 driver-internal values to the register. This encoding is described in
14144 the following table:
14145
14146 .. note::
14147
14148   Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
14149   and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
14150   always be programmed to the address of the GlobalTable, and *user data
14151   register* 1 must always be programmed to the address of the PerShaderTable.
14152
14153 ..
14154
14155   .. table:: AMDPAL User Data Mapping
14156      :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14157
14158      ==========  =================  ===============================================================================
14159      Value       Name               Description
14160      ==========  =================  ===============================================================================
14161      0..127      *User Data Entry*  32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14162      0x10000000  GlobalTable        32-bit pointer to GPU memory containing the global internal table (should
14163                                     always point to *user data register* 0).
14164      0x10000001  PerShaderTable     32-bit pointer to GPU memory containing the per-shader internal table. See
14165                                     :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14166                                     for more detail (should always point to *user data register* 1).
14167      0x10000002  SpillTable         32-bit pointer to GPU memory containing the user data spill table. See
14168                                     :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14169                                     more detail.
14170      0x10000003  BaseVertex         Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14171                                     reference the draw index in the vertex shader. Only supported by the first
14172                                     stage in a graphics pipeline.
14173      0x10000004  BaseInstance       Instance offset (32-bit unsigned integer). Only supported by the first stage in
14174                                     a graphics pipeline.
14175      0x10000005  DrawIndex          Draw index (32-bit unsigned integer). Only supported by the first stage in a
14176                                     graphics pipeline.
14177      0x10000006  Workgroup          Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
14178                                     a buffer containing the grid dimensions for a Compute dispatch operation. The
14179                                     high half of the address is stored in the next sequential user-SGPR. Only
14180                                     supported by compute pipelines.
14181      0x1000000A  EsGsLdsSize        Indicates that PAL will program this user-SGPR to contain the amount of LDS
14182                                     space used for the ES/GS pseudo-ring-buffer for passing data between shader
14183                                     stages.
14184      0x1000000B  ViewId             View id (32-bit unsigned integer) identifies a view of graphic
14185                                     pipeline instancing.
14186      0x1000000C  StreamOutTable     32-bit pointer to GPU memory containing the stream out target SRD table.  This
14187                                     can only appear for one shader stage per pipeline.
14188      0x1000000D  PerShaderPerfData  32-bit pointer to GPU memory containing the per-shader performance data buffer.
14189      0x1000000F  VertexBufferTable  32-bit pointer to GPU memory containing the vertex buffer SRD table.  This can
14190                                     only appear for one shader stage per pipeline.
14191      0x10000010  UavExportTable     32-bit pointer to GPU memory containing the UAV export SRD table.  This can
14192                                     only appear for one shader stage per pipeline (PS). These replace color targets
14193                                     and are completely separate from any UAVs used by the shader. This is optional,
14194                                     and only used by the PS when UAV exports are used to replace color-target
14195                                     exports to optimize specific shaders.
14196      0x10000011  NggCullingData     64-bit pointer to GPU memory containing the hardware register data needed by
14197                                     some NGG pipelines to perform culling.  This value contains the address of the
14198                                     first of two consecutive registers which provide the full GPU address.
14199      0x10000015  FetchShaderPtr     64-bit pointer to GPU memory containing the fetch shader subroutine.
14200      ==========  =================  ===============================================================================
14201
14202 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
14203
14204 Per-Shader Table
14205 ################
14206
14207 Low 32 bits of the GPU address for an optional buffer in the ``.data``
14208 section of the ELF. The high 32 bits of the address match the high 32 bits
14209 of the shader's program counter.
14210
14211 The buffer can be anything the shader compiler needs it for, and
14212 allows each shader to have its own region of the ``.data`` section.
14213 Typically, this could be a table of buffer SRD's and the data pointed to
14214 by the buffer SRD's, but it could be a flat-address region of memory as
14215 well. Its layout and usage are defined by the shader compiler.
14216
14217 Each shader's table in the ``.data`` section is referenced by the symbol
14218 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data``  where *xs* corresponds with the
14219 hardware shader stage the data is for. E.g.,
14220 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
14221
14222 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
14223
14224 Spill Table
14225 ###########
14226
14227 It is possible for a hardware shader to need access to more *user data
14228 entries* than there are slots available in user data registers for one
14229 or more hardware shader stages. In that case, the PAL runtime expects
14230 the necessary *user data entries* to be spilled to GPU memory and use
14231 one user data register to point to the spilled user data memory. The
14232 value of the *user data entry* must then represent the location where
14233 a shader expects to read the low 32-bits of the table's GPU virtual
14234 address. The *spill table* itself represents a set of 32-bit values
14235 managed by the PAL runtime in GPU-accessible memory that can be made
14236 indirectly accessible to a hardware shader.
14237
14238 Unspecified OS
14239 --------------
14240
14241 This section provides code conventions used when the target triple OS is
14242 empty (see :ref:`amdgpu-target-triples`).
14243
14244 Trap Handler ABI
14245 ~~~~~~~~~~~~~~~~
14246
14247 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
14248 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
14249 instructions are handled as follows:
14250
14251   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
14252      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
14253
14254      =============== =============== ===========================================
14255      Usage           Code Sequence   Description
14256      =============== =============== ===========================================
14257      llvm.trap       s_endpgm        Causes wavefront to be terminated.
14258      llvm.debugtrap  *none*          Compiler warning given that there is no
14259                                      trap handler installed.
14260      =============== =============== ===========================================
14261
14262 Source Languages
14263 ================
14264
14265 .. _amdgpu-opencl:
14266
14267 OpenCL
14268 ------
14269
14270 When the language is OpenCL the following differences occur:
14271
14272 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14273 2. The AMDGPU backend appends additional arguments to the kernel's explicit
14274    arguments for the AMDHSA OS (see
14275    :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
14276 3. Additional metadata is generated
14277    (see :ref:`amdgpu-amdhsa-code-object-metadata`).
14278
14279   .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
14280      :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
14281
14282      ======== ==== ========= ===========================================
14283      Position Byte Byte      Description
14284               Size Alignment
14285      ======== ==== ========= ===========================================
14286      1        8    8         OpenCL Global Offset X
14287      2        8    8         OpenCL Global Offset Y
14288      3        8    8         OpenCL Global Offset Z
14289      4        8    8         OpenCL address of printf buffer
14290      5        8    8         OpenCL address of virtual queue used by
14291                              enqueue_kernel.
14292      6        8    8         OpenCL address of AqlWrap struct used by
14293                              enqueue_kernel.
14294      7        8    8         Pointer argument used for Multi-gird
14295                              synchronization.
14296      ======== ==== ========= ===========================================
14297
14298 .. _amdgpu-hcc:
14299
14300 HCC
14301 ---
14302
14303 When the language is HCC the following differences occur:
14304
14305 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14306
14307 .. _amdgpu-assembler:
14308
14309 Assembler
14310 ---------
14311
14312 AMDGPU backend has LLVM-MC based assembler which is currently in development.
14313 It supports AMDGCN GFX6-GFX11.
14314
14315 This section describes general syntax for instructions and operands.
14316
14317 Instructions
14318 ~~~~~~~~~~~~
14319
14320 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
14321
14322   | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
14323     <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
14324
14325 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
14326 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
14327
14328 The order of operands and modifiers is fixed.
14329 Most modifiers are optional and may be omitted.
14330
14331 Links to detailed instruction syntax description may be found in the following
14332 table. Note that features under development are not included
14333 in this description.
14334
14335     ============= ============================================= =======================================
14336     Architecture  Core ISA                                      ISA Variants and Extensions
14337     ============= ============================================= =======================================
14338     GCN 2         :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`             \-
14339     GCN 3, GCN 4  :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`             \-
14340     GCN 5         :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
14341
14342                                                                 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
14343
14344                                                                 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
14345
14346                                                                 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
14347
14348                                                                 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
14349
14350                                                                 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
14351
14352     CDNA 1        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
14353
14354     CDNA 2        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
14355
14356     CDNA 3        :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`             :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
14357
14358     RDNA 1        :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>`     :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
14359
14360                                                                 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
14361
14362                                                                 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
14363
14364                                                                 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
14365
14366     RDNA 2        :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>`   :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
14367
14368                                                                 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
14369
14370                                                                 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
14371
14372                                                                 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
14373
14374                                                                 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
14375
14376                                                                 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
14377
14378                                                                 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
14379
14380     RDNA 3        :doc:`GFX11<AMDGPU/AMDGPUAsmGFX11>`           :doc:`gfx1100<AMDGPU/AMDGPUAsmGFX11>`
14381
14382                                                                 :doc:`gfx1101<AMDGPU/AMDGPUAsmGFX11>`
14383
14384                                                                 :doc:`gfx1102<AMDGPU/AMDGPUAsmGFX11>`
14385
14386                                                                 :doc:`gfx1103<AMDGPU/AMDGPUAsmGFX11>`
14387     ============= ============================================= =======================================
14388
14389 For more information about instructions, their semantics and supported
14390 combinations of operands, refer to one of instruction set architecture manuals
14391 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
14392 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
14393 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, [AMD-GCN-GFX10-RDNA1]_,
14394 [AMD-GCN-GFX10-RDNA2]_ and [AMD-GCN-GFX11-RDNA3]_.
14395
14396 Operands
14397 ~~~~~~~~
14398
14399 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
14400
14401 Modifiers
14402 ~~~~~~~~~
14403
14404 Detailed description of modifiers may be found
14405 :doc:`here<AMDGPUModifierSyntax>`.
14406
14407 Instruction Examples
14408 ~~~~~~~~~~~~~~~~~~~~
14409
14410 DS
14411 ++
14412
14413 .. code-block:: nasm
14414
14415   ds_add_u32 v2, v4 offset:16
14416   ds_write_src2_b64 v2 offset0:4 offset1:8
14417   ds_cmpst_f32 v2, v4, v6
14418   ds_min_rtn_f64 v[8:9], v2, v[4:5]
14419
14420 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
14421 Manual.
14422
14423 FLAT
14424 ++++
14425
14426 .. code-block:: nasm
14427
14428   flat_load_dword v1, v[3:4]
14429   flat_store_dwordx3 v[3:4], v[5:7]
14430   flat_atomic_swap v1, v[3:4], v5 glc
14431   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
14432   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
14433
14434 For full list of supported instructions, refer to "FLAT instructions" in ISA
14435 Manual.
14436
14437 MUBUF
14438 +++++
14439
14440 .. code-block:: nasm
14441
14442   buffer_load_dword v1, off, s[4:7], s1
14443   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
14444   buffer_store_format_xy v[1:2], off, s[4:7], s1
14445   buffer_wbinvl1
14446   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
14447
14448 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
14449 Manual.
14450
14451 SMRD/SMEM
14452 +++++++++
14453
14454 .. code-block:: nasm
14455
14456   s_load_dword s1, s[2:3], 0xfc
14457   s_load_dwordx8 s[8:15], s[2:3], s4
14458   s_load_dwordx16 s[88:103], s[2:3], s4
14459   s_dcache_inv_vol
14460   s_memtime s[4:5]
14461
14462 For full list of supported instructions, refer to "Scalar Memory Operations" in
14463 ISA Manual.
14464
14465 SOP1
14466 ++++
14467
14468 .. code-block:: nasm
14469
14470   s_mov_b32 s1, s2
14471   s_mov_b64 s[0:1], 0x80000000
14472   s_cmov_b32 s1, 200
14473   s_wqm_b64 s[2:3], s[4:5]
14474   s_bcnt0_i32_b64 s1, s[2:3]
14475   s_swappc_b64 s[2:3], s[4:5]
14476   s_cbranch_join s[4:5]
14477
14478 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
14479 Manual.
14480
14481 SOP2
14482 ++++
14483
14484 .. code-block:: nasm
14485
14486   s_add_u32 s1, s2, s3
14487   s_and_b64 s[2:3], s[4:5], s[6:7]
14488   s_cselect_b32 s1, s2, s3
14489   s_andn2_b32 s2, s4, s6
14490   s_lshr_b64 s[2:3], s[4:5], s6
14491   s_ashr_i32 s2, s4, s6
14492   s_bfm_b64 s[2:3], s4, s6
14493   s_bfe_i64 s[2:3], s[4:5], s6
14494   s_cbranch_g_fork s[4:5], s[6:7]
14495
14496 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
14497 Manual.
14498
14499 SOPC
14500 ++++
14501
14502 .. code-block:: nasm
14503
14504   s_cmp_eq_i32 s1, s2
14505   s_bitcmp1_b32 s1, s2
14506   s_bitcmp0_b64 s[2:3], s4
14507   s_setvskip s3, s5
14508
14509 For full list of supported instructions, refer to "SOPC Instructions" in ISA
14510 Manual.
14511
14512 SOPP
14513 ++++
14514
14515 .. code-block:: nasm
14516
14517   s_barrier
14518   s_nop 2
14519   s_endpgm
14520   s_waitcnt 0 ; Wait for all counters to be 0
14521   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
14522   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
14523   s_sethalt 9
14524   s_sleep 10
14525   s_sendmsg 0x1
14526   s_sendmsg sendmsg(MSG_INTERRUPT)
14527   s_trap 1
14528
14529 For full list of supported instructions, refer to "SOPP Instructions" in ISA
14530 Manual.
14531
14532 Unless otherwise mentioned, little verification is performed on the operands
14533 of SOPP Instructions, so it is up to the programmer to be familiar with the
14534 range or acceptable values.
14535
14536 VALU
14537 ++++
14538
14539 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
14540 the assembler will automatically use optimal encoding based on its operands. To
14541 force specific encoding, one can add a suffix to the opcode of the instruction:
14542
14543 * _e32 for 32-bit VOP1/VOP2/VOPC
14544 * _e64 for 64-bit VOP3
14545 * _dpp for VOP_DPP
14546 * _sdwa for VOP_SDWA
14547
14548 VOP1/VOP2/VOP3/VOPC examples:
14549
14550 .. code-block:: nasm
14551
14552   v_mov_b32 v1, v2
14553   v_mov_b32_e32 v1, v2
14554   v_nop
14555   v_cvt_f64_i32_e32 v[1:2], v2
14556   v_floor_f32_e32 v1, v2
14557   v_bfrev_b32_e32 v1, v2
14558   v_add_f32_e32 v1, v2, v3
14559   v_mul_i32_i24_e64 v1, v2, 3
14560   v_mul_i32_i24_e32 v1, -3, v3
14561   v_mul_i32_i24_e32 v1, -100, v3
14562   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
14563   v_max_f16_e32 v1, v2, v3
14564
14565 VOP_DPP examples:
14566
14567 .. code-block:: nasm
14568
14569   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
14570   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14571   v_mov_b32 v0, v0 wave_shl:1
14572   v_mov_b32 v0, v0 row_mirror
14573   v_mov_b32 v0, v0 row_bcast:31
14574   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
14575   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14576   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14577
14578 VOP_SDWA examples:
14579
14580 .. code-block:: nasm
14581
14582   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
14583   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
14584   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
14585   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
14586   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
14587
14588 For full list of supported instructions, refer to "Vector ALU instructions".
14589
14590 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
14591
14592 Code Object V2 Predefined Symbols
14593 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14594
14595 .. warning::
14596   Code object V2 is not the default code object version emitted by
14597   this version of LLVM.
14598
14599 The AMDGPU assembler defines and updates some symbols automatically. These
14600 symbols do not affect code generation.
14601
14602 .option.machine_version_major
14603 +++++++++++++++++++++++++++++
14604
14605 Set to the GFX major generation number of the target being assembled for. For
14606 example, when assembling for a "GFX9" target this will be set to the integer
14607 value "9". The possible GFX major generation numbers are presented in
14608 :ref:`amdgpu-processors`.
14609
14610 .option.machine_version_minor
14611 +++++++++++++++++++++++++++++
14612
14613 Set to the GFX minor generation number of the target being assembled for. For
14614 example, when assembling for a "GFX810" target this will be set to the integer
14615 value "1". The possible GFX minor generation numbers are presented in
14616 :ref:`amdgpu-processors`.
14617
14618 .option.machine_version_stepping
14619 ++++++++++++++++++++++++++++++++
14620
14621 Set to the GFX stepping generation number of the target being assembled for.
14622 For example, when assembling for a "GFX704" target this will be set to the
14623 integer value "4". The possible GFX stepping generation numbers are presented
14624 in :ref:`amdgpu-processors`.
14625
14626 .kernel.vgpr_count
14627 ++++++++++++++++++
14628
14629 Set to zero each time a
14630 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14631 encountered. At each instruction, if the current value of this symbol is less
14632 than or equal to the maximum VGPR number explicitly referenced within that
14633 instruction then the symbol value is updated to equal that VGPR number plus
14634 one.
14635
14636 .kernel.sgpr_count
14637 ++++++++++++++++++
14638
14639 Set to zero each time a
14640 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14641 encountered. At each instruction, if the current value of this symbol is less
14642 than or equal to the maximum VGPR number explicitly referenced within that
14643 instruction then the symbol value is updated to equal that SGPR number plus
14644 one.
14645
14646 .. _amdgpu-amdhsa-assembler-directives-v2:
14647
14648 Code Object V2 Directives
14649 ~~~~~~~~~~~~~~~~~~~~~~~~~
14650
14651 .. warning::
14652   Code object V2 is not the default code object version emitted by
14653   this version of LLVM.
14654
14655 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
14656 one can specify them with assembler directives.
14657
14658 .hsa_code_object_version major, minor
14659 +++++++++++++++++++++++++++++++++++++
14660
14661 *major* and *minor* are integers that specify the version of the HSA code
14662 object that will be generated by the assembler.
14663
14664 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
14665 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
14666
14667
14668 *major*, *minor*, and *stepping* are all integers that describe the instruction
14669 set architecture (ISA) version of the assembly program.
14670
14671 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
14672 "AMD" and *arch* should always be equal to "AMDGPU".
14673
14674 By default, the assembler will derive the ISA version, *vendor*, and *arch*
14675 from the value of the -mcpu option that is passed to the assembler.
14676
14677 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
14678
14679 .amdgpu_hsa_kernel (name)
14680 +++++++++++++++++++++++++
14681
14682 This directives specifies that the symbol with given name is a kernel entry
14683 point (label) and the object should contain corresponding symbol of type
14684 STT_AMDGPU_HSA_KERNEL.
14685
14686 .amd_kernel_code_t
14687 ++++++++++++++++++
14688
14689 This directive marks the beginning of a list of key / value pairs that are used
14690 to specify the amd_kernel_code_t object that will be emitted by the assembler.
14691 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
14692 amd_kernel_code_t values that are unspecified a default value will be used. The
14693 default value for all keys is 0, with the following exceptions:
14694
14695 - *amd_code_version_major* defaults to 1.
14696 - *amd_kernel_code_version_minor* defaults to 2.
14697 - *amd_machine_kind* defaults to 1.
14698 - *amd_machine_version_major*, *machine_version_minor*, and
14699   *amd_machine_version_stepping* are derived from the value of the -mcpu option
14700   that is passed to the assembler.
14701 - *kernel_code_entry_byte_offset* defaults to 256.
14702 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
14703   defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
14704   Note that wavefront size is specified as a power of two, so a value of **n**
14705   means a size of 2^ **n**.
14706 - *call_convention* defaults to -1.
14707 - *kernarg_segment_alignment*, *group_segment_alignment*, and
14708   *private_segment_alignment* default to 4. Note that alignments are specified
14709   as a power of 2, so a value of **n** means an alignment of 2^ **n**.
14710 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
14711   GFX90A onwards.
14712 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
14713   GFX10 onwards.
14714 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
14715
14716 The *.amd_kernel_code_t* directive must be placed immediately after the
14717 function label and before any instructions.
14718
14719 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
14720 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
14721
14722 .. _amdgpu-amdhsa-assembler-example-v2:
14723
14724 Code Object V2 Example Source Code
14725 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14726
14727 .. warning::
14728   Code Object V2 is not the default code object version emitted by
14729   this version of LLVM.
14730
14731 Here is an example of a minimal assembly source file, defining one HSA kernel:
14732
14733 .. code::
14734    :number-lines:
14735
14736    .hsa_code_object_version 1,0
14737    .hsa_code_object_isa
14738
14739    .hsatext
14740    .globl  hello_world
14741    .p2align 8
14742    .amdgpu_hsa_kernel hello_world
14743
14744    hello_world:
14745
14746       .amd_kernel_code_t
14747          enable_sgpr_kernarg_segment_ptr = 1
14748          is_ptr64 = 1
14749          compute_pgm_rsrc1_vgprs = 0
14750          compute_pgm_rsrc1_sgprs = 0
14751          compute_pgm_rsrc2_user_sgpr = 2
14752          compute_pgm_rsrc1_wgp_mode = 0
14753          compute_pgm_rsrc1_mem_ordered = 0
14754          compute_pgm_rsrc1_fwd_progress = 1
14755      .end_amd_kernel_code_t
14756
14757      s_load_dwordx2 s[0:1], s[0:1] 0x0
14758      v_mov_b32 v0, 3.14159
14759      s_waitcnt lgkmcnt(0)
14760      v_mov_b32 v1, s0
14761      v_mov_b32 v2, s1
14762      flat_store_dword v[1:2], v0
14763      s_endpgm
14764    .Lfunc_end0:
14765         .size   hello_world, .Lfunc_end0-hello_world
14766
14767 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
14768
14769 Code Object V3 and Above Predefined Symbols
14770 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14771
14772 The AMDGPU assembler defines and updates some symbols automatically. These
14773 symbols do not affect code generation.
14774
14775 .amdgcn.gfx_generation_number
14776 +++++++++++++++++++++++++++++
14777
14778 Set to the GFX major generation number of the target being assembled for. For
14779 example, when assembling for a "GFX9" target this will be set to the integer
14780 value "9". The possible GFX major generation numbers are presented in
14781 :ref:`amdgpu-processors`.
14782
14783 .amdgcn.gfx_generation_minor
14784 ++++++++++++++++++++++++++++
14785
14786 Set to the GFX minor generation number of the target being assembled for. For
14787 example, when assembling for a "GFX810" target this will be set to the integer
14788 value "1". The possible GFX minor generation numbers are presented in
14789 :ref:`amdgpu-processors`.
14790
14791 .amdgcn.gfx_generation_stepping
14792 +++++++++++++++++++++++++++++++
14793
14794 Set to the GFX stepping generation number of the target being assembled for.
14795 For example, when assembling for a "GFX704" target this will be set to the
14796 integer value "4". The possible GFX stepping generation numbers are presented
14797 in :ref:`amdgpu-processors`.
14798
14799 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
14800
14801 .amdgcn.next_free_vgpr
14802 ++++++++++++++++++++++
14803
14804 Set to zero before assembly begins. At each instruction, if the current value
14805 of this symbol is less than or equal to the maximum VGPR number explicitly
14806 referenced within that instruction then the symbol value is updated to equal
14807 that VGPR number plus one.
14808
14809 May be used to set the `.amdhsa_next_free_vgpr` directive in
14810 :ref:`amdhsa-kernel-directives-table`.
14811
14812 May be set at any time, e.g. manually set to zero at the start of each kernel.
14813
14814 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
14815
14816 .amdgcn.next_free_sgpr
14817 ++++++++++++++++++++++
14818
14819 Set to zero before assembly begins. At each instruction, if the current value
14820 of this symbol is less than or equal the maximum SGPR number explicitly
14821 referenced within that instruction then the symbol value is updated to equal
14822 that SGPR number plus one.
14823
14824 May be used to set the `.amdhsa_next_free_spgr` directive in
14825 :ref:`amdhsa-kernel-directives-table`.
14826
14827 May be set at any time, e.g. manually set to zero at the start of each kernel.
14828
14829 .. _amdgpu-amdhsa-assembler-directives-v3-onwards:
14830
14831 Code Object V3 and Above Directives
14832 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14833
14834 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
14835 architecture processors, and are not OS-specific. Directives which begin with
14836 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
14837 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
14838 :ref:`amdgpu-processors`.
14839
14840 .. _amdgpu-assembler-directive-amdgcn-target:
14841
14842 .amdgcn_target <target-triple> "-" <target-id>
14843 ++++++++++++++++++++++++++++++++++++++++++++++
14844
14845 Optional directive which declares the ``<target-triple>-<target-id>`` supported
14846 by the containing assembler source file. Used by the assembler to validate
14847 command-line options such as ``-triple``, ``-mcpu``, and
14848 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
14849 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
14850
14851 .. note::
14852
14853   The target ID syntax used for code object V2 to V3 for this directive differs
14854   from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
14855
14856 .amdhsa_kernel <name>
14857 +++++++++++++++++++++
14858
14859 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
14860 ``<name>.kd``, in the current location of the current section. Only valid when
14861 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
14862 instruction to execute, and does not need to be previously defined.
14863
14864 Marks the beginning of a list of directives used to generate the bytes of a
14865 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
14866 Directives which may appear in this list are described in
14867 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
14868 be valid for the target being assembled for, and cannot be repeated. Directives
14869 support the range of values specified by the field they reference in
14870 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
14871 assumed to have its default value, unless it is marked as "Required", in which
14872 case it is an error to omit the directive. This list of directives is
14873 terminated by an ``.end_amdhsa_kernel`` directive.
14874
14875   .. table:: AMDHSA Kernel Assembler Directives
14876      :name: amdhsa-kernel-directives-table
14877
14878      ======================================================== =================== ============ ===================
14879      Directive                                                Default             Supported On Description
14880      ======================================================== =================== ============ ===================
14881      ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX11   Controls GROUP_SEGMENT_FIXED_SIZE in
14882                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14883      ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX11   Controls PRIVATE_SEGMENT_FIXED_SIZE in
14884                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14885      ``.amdhsa_kernarg_size``                                 0                   GFX6-GFX11   Controls KERNARG_SIZE in
14886                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14887      ``.amdhsa_user_sgpr_count``                              0                   GFX6-GFX11   Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
14888                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`
14889      ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
14890                                                                                   (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14891                                                                                   GFX940)
14892      ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX11   Controls ENABLE_SGPR_DISPATCH_PTR in
14893                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14894      ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX11   Controls ENABLE_SGPR_QUEUE_PTR in
14895                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14896      ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX11   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
14897                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14898      ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX11   Controls ENABLE_SGPR_DISPATCH_ID in
14899                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14900      ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
14901                                                                                   (except      :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14902                                                                                   GFX940)
14903      ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX11   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
14904                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14905      ``.amdhsa_wavefront_size32``                             Target              GFX10-GFX11  Controls ENABLE_WAVEFRONT_SIZE32 in
14906                                                               Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14907                                                               Specific
14908                                                               (wavefrontsize64)
14909      ``.amdhsa_uses_dynamic_stack``                           0                   GFX6-GFX11   Controls USES_DYNAMIC_STACK in
14910                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14911      ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_PRIVATE_SEGMENT in
14912                                                                                   (except      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14913                                                                                   GFX940)
14914      ``.amdhsa_enable_private_segment``                       0                   GFX940,      Controls ENABLE_PRIVATE_SEGMENT in
14915                                                                                   GFX11        :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14916      ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_X in
14917                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14918      ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
14919                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14920      ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
14921                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14922      ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX11   Controls ENABLE_SGPR_WORKGROUP_INFO in
14923                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14924      ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX11   Controls ENABLE_VGPR_WORKITEM_ID in
14925                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14926                                                                                                Possible values are defined in
14927                                                                                                :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
14928      ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX11   Maximum VGPR number explicitly referenced, plus one.
14929                                                                                                Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
14930                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14931      ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX11   Maximum SGPR number explicitly referenced, plus one.
14932                                                                                                Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14933                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14934      ``.amdhsa_accum_offset``                                 Required            GFX90A,      Offset of a first AccVGPR in the unified register file.
14935                                                                                   GFX940       Used to calculate ACCUM_OFFSET in
14936                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14937      ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX11   Whether the kernel may use the special VCC SGPR.
14938                                                                                                Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14939                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14940      ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
14941                                                                                   (except      scratch memory. Used to calculate
14942                                                                                   GFX940)      GRANULATED_WAVEFRONT_SGPR_COUNT in
14943                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14944      ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
14945                                                               Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14946                                                               Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14947                                                               (xnack)
14948      ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX11   Controls FLOAT_ROUND_MODE_32 in
14949                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14950                                                                                                Possible values are defined in
14951                                                                                                :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
14952      ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX11   Controls FLOAT_ROUND_MODE_16_64 in
14953                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14954                                                                                                Possible values are defined in
14955                                                                                                :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
14956      ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX11   Controls FLOAT_DENORM_MODE_32 in
14957                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14958                                                                                                Possible values are defined in
14959                                                                                                :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
14960      ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX11   Controls FLOAT_DENORM_MODE_16_64 in
14961                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14962                                                                                                Possible values are defined in
14963                                                                                                :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
14964      ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX11   Controls ENABLE_DX10_CLAMP in
14965                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14966      ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX11   Controls ENABLE_IEEE_MODE in
14967                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14968      ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX11   Controls FP16_OVFL in
14969                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14970      ``.amdhsa_tg_split``                                     Target              GFX90A,      Controls TG_SPLIT in
14971                                                               Feature             GFX940,      :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14972                                                               Specific            GFX11
14973                                                               (tgsplit)
14974      ``.amdhsa_workgroup_processor_mode``                     Target              GFX10-GFX11  Controls ENABLE_WGP_MODE in
14975                                                               Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14976                                                               Specific
14977                                                               (cumode)
14978      ``.amdhsa_memory_ordered``                               1                   GFX10-GFX11  Controls MEM_ORDERED in
14979                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14980      ``.amdhsa_forward_progress``                             0                   GFX10-GFX11  Controls FWD_PROGRESS in
14981                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14982      ``.amdhsa_shared_vgpr_count``                            0                   GFX10-GFX11  Controls SHARED_VGPR_COUNT in
14983                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
14984      ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
14985                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14986      ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
14987                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14988      ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
14989                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14990      ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
14991                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14992      ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
14993                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14994      ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
14995                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14996      ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX11   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
14997                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14998      ======================================================== =================== ============ ===================
14999
15000 .amdgpu_metadata
15001 ++++++++++++++++
15002
15003 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
15004 note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
15005
15006 The contents must be in the [YAML]_ markup format, with the same structure and
15007 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
15008 :ref:`amdgpu-amdhsa-code-object-metadata-v4` or
15009 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
15010
15011 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
15012
15013 .. _amdgpu-amdhsa-assembler-example-v3-onwards:
15014
15015 Code Object V3 and Above Example Source Code
15016 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15017
15018 Here is an example of a minimal assembly source file, defining one HSA kernel:
15019
15020 .. code::
15021    :number-lines:
15022
15023    .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15024
15025    .text
15026    .globl hello_world
15027    .p2align 8
15028    .type hello_world,@function
15029    hello_world:
15030      s_load_dwordx2 s[0:1], s[0:1] 0x0
15031      v_mov_b32 v0, 3.14159
15032      s_waitcnt lgkmcnt(0)
15033      v_mov_b32 v1, s0
15034      v_mov_b32 v2, s1
15035      flat_store_dword v[1:2], v0
15036      s_endpgm
15037    .Lfunc_end0:
15038      .size   hello_world, .Lfunc_end0-hello_world
15039
15040    .rodata
15041    .p2align 6
15042    .amdhsa_kernel hello_world
15043      .amdhsa_user_sgpr_kernarg_segment_ptr 1
15044      .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15045      .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15046    .end_amdhsa_kernel
15047
15048    .amdgpu_metadata
15049    ---
15050    amdhsa.version:
15051      - 1
15052      - 0
15053    amdhsa.kernels:
15054      - .name: hello_world
15055        .symbol: hello_world.kd
15056        .kernarg_segment_size: 48
15057        .group_segment_fixed_size: 0
15058        .private_segment_fixed_size: 0
15059        .kernarg_segment_align: 4
15060        .wavefront_size: 64
15061        .sgpr_count: 2
15062        .vgpr_count: 3
15063        .max_flat_workgroup_size: 256
15064        .args:
15065          - .size: 8
15066            .offset: 0
15067            .value_kind: global_buffer
15068            .address_space: global
15069            .actual_access: write_only
15070    //...
15071    .end_amdgpu_metadata
15072
15073 This kernel is equivalent to the following HIP program:
15074
15075 .. code::
15076    :number-lines:
15077
15078    __global__ void hello_world(float *p) {
15079        *p = 3.14159f;
15080    }
15081
15082 If an assembly source file contains multiple kernels and/or functions, the
15083 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
15084 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
15085 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
15086 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
15087 to group the function with the kernel that calls it and reset the symbols
15088 between the two connected components:
15089
15090 .. code::
15091    :number-lines:
15092
15093    .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15094
15095    // gpr tracking symbols are implicitly set to zero
15096
15097    .text
15098    .globl kern0
15099    .p2align 8
15100    .type kern0,@function
15101    kern0:
15102      // ...
15103      s_endpgm
15104    .Lkern0_end:
15105      .size   kern0, .Lkern0_end-kern0
15106
15107    .rodata
15108    .p2align 6
15109    .amdhsa_kernel kern0
15110      // ...
15111      .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15112      .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15113    .end_amdhsa_kernel
15114
15115    // reset symbols to begin tracking usage in func1 and kern1
15116    .set .amdgcn.next_free_vgpr, 0
15117    .set .amdgcn.next_free_sgpr, 0
15118
15119    .text
15120    .hidden func1
15121    .global func1
15122    .p2align 2
15123    .type func1,@function
15124    func1:
15125      // ...
15126      s_setpc_b64 s[30:31]
15127    .Lfunc1_end:
15128    .size func1, .Lfunc1_end-func1
15129
15130    .globl kern1
15131    .p2align 8
15132    .type kern1,@function
15133    kern1:
15134      // ...
15135      s_getpc_b64 s[4:5]
15136      s_add_u32 s4, s4, func1@rel32@lo+4
15137      s_addc_u32 s5, s5, func1@rel32@lo+4
15138      s_swappc_b64 s[30:31], s[4:5]
15139      // ...
15140      s_endpgm
15141    .Lkern1_end:
15142      .size   kern1, .Lkern1_end-kern1
15143
15144    .rodata
15145    .p2align 6
15146    .amdhsa_kernel kern1
15147      // ...
15148      .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15149      .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15150    .end_amdhsa_kernel
15151
15152 These symbols cannot identify connected components in order to automatically
15153 track the usage for each kernel. However, in some cases careful organization of
15154 the kernels and functions in the source file means there is minimal additional
15155 effort required to accurately calculate GPR usage.
15156
15157 Additional Documentation
15158 ========================
15159
15160 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
15161 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
15162 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
15163 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
15164 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
15165 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
15166 .. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
15167 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
15168 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
15169 .. [AMD-GCN-GFX11-RDNA3] `AMD RDNA 3 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf>`__
15170 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
15171 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
15172 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
15173 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
15174 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
15175 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
15176 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
15177 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
15178 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
15179 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
15180 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
15181 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
15182 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
15183 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
15184 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
15185 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__