llvm/docs/AMDGPUUsage.rst

   1 =============================
   2 User Guide for AMDGPU Backend
   3 =============================
   4
   5 .. contents::
   6    :local:
   7
   8 .. toctree::
   9    :hidden:
  10
  11    AMDGPU/AMDGPUAsmGFX7
  12    AMDGPU/AMDGPUAsmGFX8
  13    AMDGPU/AMDGPUAsmGFX9
  14    AMDGPU/AMDGPUAsmGFX900
  15    AMDGPU/AMDGPUAsmGFX904
  16    AMDGPU/AMDGPUAsmGFX906
  17    AMDGPU/AMDGPUAsmGFX908
  18    AMDGPU/AMDGPUAsmGFX90a
  19    AMDGPU/AMDGPUAsmGFX10
  20    AMDGPU/AMDGPUAsmGFX1011
  21    AMDGPUModifierSyntax
  22    AMDGPUOperandSyntax
  23    AMDGPUInstructionSyntax
  24    AMDGPUInstructionNotation
  25    AMDGPUDwarfExtensionsForHeterogeneousDebugging
  26    AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
  27
  28 Introduction
  29 ============
  30
  31 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
  32 R600 family up until the current GCN families. It lives in the
  33 ``llvm/lib/Target/AMDGPU`` directory.
  34
  35 LLVM
  36 ====
  37
  38 .. _amdgpu-target-triples:
  39
  40 Target Triples
  41 --------------
  42
  43 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
  44 to specify the target triple:
  45
  46   .. table:: AMDGPU Architectures
  47      :name: amdgpu-architecture-table
  48
  49      ============ ==============================================================
  50      Architecture Description
  51      ============ ==============================================================
  52      ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
  53      ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
  54      ============ ==============================================================
  55
  56   .. table:: AMDGPU Vendors
  57      :name: amdgpu-vendor-table
  58
  59      ============ ==============================================================
  60      Vendor       Description
  61      ============ ==============================================================
  62      ``amd``      Can be used for all AMD GPU usage.
  63      ``mesa3d``   Can be used if the OS is ``mesa3d``.
  64      ============ ==============================================================
  65
  66   .. table:: AMDGPU Operating Systems
  67      :name: amdgpu-os
  68
  69      ============== ============================================================
  70      OS             Description
  71      ============== ============================================================
  72      *<empty>*      Defaults to the *unknown* OS.
  73      ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
  74                     such as:
  75
  76                     - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
  77                       loader on Linux. See *AMD ROCm Platform Release Notes*
  78                       [AMD-ROCm-Release-Notes]_ for supported hardware and
  79                       software.
  80                     - AMD's PAL runtime using the *pal-amdhsa* loader on
  81                       Windows.
  82
  83      ``amdpal``     Graphic shaders and compute kernels executed on AMD's PAL
  84                     runtime using the *pal-amdpal* loader on Windows and Linux
  85                     Pro.
  86      ``mesa3d``     Graphic shaders and compute kernels executed on AMD's Mesa
  87                     3D runtime using the *mesa-mesa3d* loader on Linux.
  88      ============== ============================================================
  89
  90   .. table:: AMDGPU Environments
  91      :name: amdgpu-environment-table
  92
  93      ============ ==============================================================
  94      Environment  Description
  95      ============ ==============================================================
  96      *<empty>*    Default.
  97      ============ ==============================================================
  98
  99 .. _amdgpu-processors:
 100
 101 Processors
 102 ----------
 103
 104 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
 105 specify the AMDGPU processor together with optional target features. See
 106 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
 107 specific information.
 108
 109 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
 110
 111 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
 112
 113
 114   .. table:: AMDGPU Processors
 115      :name: amdgpu-processor-table
 116
 117      =========== =============== ============ ===== ================= =============== =============== ======================
 118      Processor   Alternative     Target       dGPU/ Target            Target          OS Support      Example
 119                  Processor       Triple       APU   Features          Properties      *(see*          Products
 120                                  Architecture       Supported                         `amdgpu-os`_
 121                                                                                       *and
 122                                                                                       corresponding
 123                                                                                       runtime release
 124                                                                                       notes for
 125                                                                                       current
 126                                                                                       information and
 127                                                                                       level of
 128                                                                                       support)*
 129      =========== =============== ============ ===== ================= =============== =============== ======================
 130      **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
 131      -----------------------------------------------------------------------------------------------------------------------
 132      ``r600``                    ``r600``     dGPU                    - Does not
 133                                                                         support
 134                                                                         generic
 135                                                                         address
 136                                                                         space
 137      ``r630``                    ``r600``     dGPU                    - Does not
 138                                                                         support
 139                                                                         generic
 140                                                                         address
 141                                                                         space
 142      ``rs880``                   ``r600``     dGPU                    - Does not
 143                                                                         support
 144                                                                         generic
 145                                                                         address
 146                                                                         space
 147      ``rv670``                   ``r600``     dGPU                    - Does not
 148                                                                         support
 149                                                                         generic
 150                                                                         address
 151                                                                         space
 152      **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
 153      -----------------------------------------------------------------------------------------------------------------------
 154      ``rv710``                   ``r600``     dGPU                    - Does not
 155                                                                         support
 156                                                                         generic
 157                                                                         address
 158                                                                         space
 159      ``rv730``                   ``r600``     dGPU                    - Does not
 160                                                                         support
 161                                                                         generic
 162                                                                         address
 163                                                                         space
 164      ``rv770``                   ``r600``     dGPU                    - Does not
 165                                                                         support
 166                                                                         generic
 167                                                                         address
 168                                                                         space
 169      **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
 170      -----------------------------------------------------------------------------------------------------------------------
 171      ``cedar``                   ``r600``     dGPU                    - Does not
 172                                                                         support
 173                                                                         generic
 174                                                                         address
 175                                                                         space
 176      ``cypress``                 ``r600``     dGPU                    - Does not
 177                                                                         support
 178                                                                         generic
 179                                                                         address
 180                                                                         space
 181      ``juniper``                 ``r600``     dGPU                    - Does not
 182                                                                         support
 183                                                                         generic
 184                                                                         address
 185                                                                         space
 186      ``redwood``                 ``r600``     dGPU                    - Does not
 187                                                                         support
 188                                                                         generic
 189                                                                         address
 190                                                                         space
 191      ``sumo``                    ``r600``     dGPU                    - Does not
 192                                                                         support
 193                                                                         generic
 194                                                                         address
 195                                                                         space
 196      **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
 197      -----------------------------------------------------------------------------------------------------------------------
 198      ``barts``                   ``r600``     dGPU                    - Does not
 199                                                                         support
 200                                                                         generic
 201                                                                         address
 202                                                                         space
 203      ``caicos``                  ``r600``     dGPU                    - Does not
 204                                                                         support
 205                                                                         generic
 206                                                                         address
 207                                                                         space
 208      ``cayman``                  ``r600``     dGPU                    - Does not
 209                                                                         support
 210                                                                         generic
 211                                                                         address
 212                                                                         space
 213      ``turks``                   ``r600``     dGPU                    - Does not
 214                                                                         support
 215                                                                         generic
 216                                                                         address
 217                                                                         space
 218      **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
 219      -----------------------------------------------------------------------------------------------------------------------
 220      ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
 221                                                                         support
 222                                                                         generic
 223                                                                         address
 224                                                                         space
 225      ``gfx601``  - ``pitcairn``  ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
 226                  - ``verde``                                            support
 227                                                                         generic
 228                                                                         address
 229                                                                         space
 230      ``gfx602``  - ``hainan``    ``amdgcn``   dGPU                    - Does not      - *pal-amdpal*
 231                  - ``oland``                                            support
 232                                                                         generic
 233                                                                         address
 234                                                                         space
 235      **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
 236      -----------------------------------------------------------------------------------------------------------------------
 237      ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - Offset        - *rocm-amdhsa* - A6-7000
 238                                                                         flat          - *pal-amdhsa*  - A6 Pro-7050B
 239                                                                         scratch       - *pal-amdpal*  - A8-7100
 240                                                                                                       - A8 Pro-7150B
 241                                                                                                       - A10-7300
 242                                                                                                       - A10 Pro-7350B
 243                                                                                                       - FX-7500
 244                                                                                                       - A8-7200P
 245                                                                                                       - A10-7400P
 246                                                                                                       - FX-7600P
 247      ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro W8100
 248                                                                         flat          - *pal-amdhsa*  - FirePro W9100
 249                                                                         scratch       - *pal-amdpal*  - FirePro S9150
 250                                                                                                       - FirePro S9170
 251      ``gfx702``                  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 290
 252                                                                         flat          - *pal-amdhsa*  - Radeon R9 290x
 253                                                                         scratch       - *pal-amdpal*  - Radeon R390
 254                                                                                                       - Radeon R390x
 255      ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  - E1-2100
 256                  - ``mullins``                                          flat          - *pal-amdpal*  - E1-2200
 257                                                                         scratch                       - E1-2500
 258                                                                                                       - E2-3000
 259                                                                                                       - E2-3800
 260                                                                                                       - A4-5000
 261                                                                                                       - A4-5100
 262                                                                                                       - A6-5200
 263                                                                                                       - A4 Pro-3340B
 264      ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Offset        - *pal-amdhsa*  - Radeon HD 7790
 265                                                                         flat          - *pal-amdpal*  - Radeon HD 8770
 266                                                                         scratch                       - R7 260
 267                                                                                                       - R7 260X
 268      ``gfx705``                  ``amdgcn``   APU                     - Offset        - *pal-amdhsa*  *TBA*
 269                                                                         flat          - *pal-amdpal*
 270                                                                         scratch                       .. TODO::
 271
 272                                                                                                         Add product
 273                                                                                                         names.
 274
 275      **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
 276      -----------------------------------------------------------------------------------------------------------------------
 277      ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* - A6-8500P
 278                                                                         flat          - *pal-amdhsa*  - Pro A6-8500B
 279                                                                         scratch       - *pal-amdpal*  - A8-8600P
 280                                                                                                       - Pro A8-8600B
 281                                                                                                       - FX-8800P
 282                                                                                                       - Pro A12-8800B
 283                                                                                                       - A10-8700P
 284                                                                                                       - Pro A10-8700B
 285                                                                                                       - A10-8780P
 286                                                                                                       - A10-9600P
 287                                                                                                       - A10-9630P
 288                                                                                                       - A12-9700P
 289                                                                                                       - A12-9730P
 290                                                                                                       - FX-9800P
 291                                                                                                       - FX-9830P
 292                                                                                                       - E2-9010
 293                                                                                                       - A6-9210
 294                                                                                                       - A9-9410
 295      ``gfx802``  - ``iceland``   ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon R9 285
 296                  - ``tonga``                                            flat          - *pal-amdhsa*  - Radeon R9 380
 297                                                                         scratch       - *pal-amdpal*  - Radeon R9 385
 298      ``gfx803``  - ``fiji``      ``amdgcn``   dGPU                                    - *rocm-amdhsa* - Radeon R9 Nano
 299                                                                                       - *pal-amdhsa*  - Radeon R9 Fury
 300                                                                                       - *pal-amdpal*  - Radeon R9 FuryX
 301                                                                                                       - Radeon Pro Duo
 302                                                                                                       - FirePro S9300x2
 303                                                                                                       - Radeon Instinct MI8
 304      \           - ``polaris10`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 470
 305                                                                         flat          - *pal-amdhsa*  - Radeon RX 480
 306                                                                         scratch       - *pal-amdpal*  - Radeon Instinct MI6
 307      \           - ``polaris11`` ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - Radeon RX 460
 308                                                                         flat          - *pal-amdhsa*
 309                                                                         scratch       - *pal-amdpal*
 310      ``gfx805``  - ``tongapro``  ``amdgcn``   dGPU                    - Offset        - *rocm-amdhsa* - FirePro S7150
 311                                                                         flat          - *pal-amdhsa*  - FirePro S7100
 312                                                                         scratch       - *pal-amdpal*  - FirePro W7100
 313                                                                                                       - Mobile FirePro
 314                                                                                                         M7170
 315      ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack           - Offset        - *rocm-amdhsa* *TBA*
 316                                                                         flat          - *pal-amdhsa*
 317                                                                         scratch       - *pal-amdpal*  .. TODO::
 318
 319                                                                                                         Add product
 320                                                                                                         names.
 321
 322      **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_
 323      -----------------------------------------------------------------------------------------------------------------------
 324      ``gfx900``                  ``amdgcn``   dGPU  - xnack           - Absolute      - *rocm-amdhsa* - Radeon Vega
 325                                                                         flat          - *pal-amdhsa*    Frontier Edition
 326                                                                         scratch       - *pal-amdpal*  - Radeon RX Vega 56
 327                                                                                                       - Radeon RX Vega 64
 328                                                                                                       - Radeon RX Vega 64
 329                                                                                                         Liquid
 330                                                                                                       - Radeon Instinct MI25
 331      ``gfx902``                  ``amdgcn``   APU   - xnack           - Absolute      - *rocm-amdhsa* - Ryzen 3 2200G
 332                                                                         flat          - *pal-amdhsa*  - Ryzen 5 2400G
 333                                                                         scratch       - *pal-amdpal*
 334      ``gfx904``                  ``amdgcn``   dGPU  - xnack                           - *rocm-amdhsa* *TBA*
 335                                                                                       - *pal-amdhsa*
 336                                                                                       - *pal-amdpal*  .. TODO::
 337
 338                                                                                                         Add product
 339                                                                                                         names.
 340
 341      ``gfx906``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* - Radeon Instinct MI50
 342                                                     - xnack             flat          - *pal-amdhsa*  - Radeon Instinct MI60
 343                                                                         scratch       - *pal-amdpal*  - Radeon VII
 344                                                                                                       - Radeon Pro VII
 345      ``gfx908``                  ``amdgcn``   dGPU  - sramecc                         - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
 346                                                     - xnack           - Absolute
 347                                                                         flat
 348                                                                         scratch
 349      ``gfx909``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  *TBA*
 350                                                                         flat
 351                                                                         scratch                       .. TODO::
 352
 353                                                                                                         Add product
 354                                                                                                         names.
 355
 356      ``gfx90a``                  ``amdgcn``   dGPU  - sramecc         - Absolute      - *rocm-amdhsa* *TBA*
 357                                                     - tgsplit           flat
 358                                                     - xnack             scratch                       .. TODO::
 359                                                                       - Packed
 360                                                                         work-item                       Add product
 361                                                                         IDs                             names.
 362
 363      ``gfx90c``                  ``amdgcn``   APU   - xnack           - Absolute      - *pal-amdpal*  - Ryzen 7 4700G
 364                                                                         flat                          - Ryzen 7 4700GE
 365                                                                         scratch                       - Ryzen 5 4600G
 366                                                                                                       - Ryzen 5 4600GE
 367                                                                                                       - Ryzen 3 4300G
 368                                                                                                       - Ryzen 3 4300GE
 369                                                                                                       - Ryzen Pro 4000G
 370                                                                                                       - Ryzen 7 Pro 4700G
 371                                                                                                       - Ryzen 7 Pro 4750GE
 372                                                                                                       - Ryzen 5 Pro 4650G
 373                                                                                                       - Ryzen 5 Pro 4650GE
 374                                                                                                       - Ryzen 3 Pro 4350G
 375                                                                                                       - Ryzen 3 Pro 4350GE
 376
 377      **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
 378      -----------------------------------------------------------------------------------------------------------------------
 379      ``gfx1010``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5700
 380                                                     - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5700 XT
 381                                                     - xnack             scratch       - *pal-amdpal*  - Radeon Pro 5600 XT
 382                                                                                                       - Radeon Pro 5600M
 383      ``gfx1011``                 ``amdgcn``   dGPU  - cumode                          - *rocm-amdhsa* - Radeon Pro V520
 384                                                     - wavefrontsize64 - Absolute      - *pal-amdhsa*
 385                                                     - xnack             flat          - *pal-amdpal*
 386                                                                         scratch
 387      ``gfx1012``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 5500
 388                                                     - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 5500 XT
 389                                                     - xnack             scratch       - *pal-amdpal*
 390      ``gfx1013``                 ``amdgcn``   APU   - cumode          - Absolute      - *rocm-amdhsa* *TBA*
 391                                                     - wavefrontsize64   flat          - *pal-amdhsa*
 392                                                     - xnack             scratch       - *pal-amdpal*  .. TODO::
 393
 394                                                                                                         Add product
 395                                                                                                         names.
 396
 397      **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
 398      -----------------------------------------------------------------------------------------------------------------------
 399      ``gfx1030``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6800
 400                                                     - wavefrontsize64   flat          - *pal-amdhsa*  - Radeon RX 6800 XT
 401                                                                         scratch       - *pal-amdpal*  - Radeon RX 6900 XT
 402      ``gfx1031``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* - Radeon RX 6700 XT
 403                                                     - wavefrontsize64   flat          - *pal-amdhsa*
 404                                                                         scratch       - *pal-amdpal*
 405      ``gfx1032``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *rocm-amdhsa* *TBA*
 406                                                     - wavefrontsize64   flat          - *pal-amdhsa*
 407                                                                         scratch       - *pal-amdpal*  .. TODO::
 408
 409                                                                                                         Add product
 410                                                                                                         names.
 411
 412      ``gfx1033``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
 413                                                     - wavefrontsize64   flat
 414                                                                         scratch                       .. TODO::
 415
 416                                                                                                         Add product
 417                                                                                                         names.
 418      ``gfx1034``                 ``amdgcn``   dGPU  - cumode          - Absolute      - *pal-amdpal*  *TBA*
 419                                                     - wavefrontsize64   flat
 420                                                                         scratch                       .. TODO::
 421
 422                                                                                                         Add product
 423                                                                                                         names.
 424
 425      ``gfx1035``                 ``amdgcn``   APU   - cumode          - Absolute      - *pal-amdpal*  *TBA*
 426                                                     - wavefrontsize64   flat
 427                                                                         scratch                       .. TODO::
 428                                                                                                         Add product
 429                                                                                                         names.
 430
 431      =========== =============== ============ ===== ================= =============== =============== ======================
 432
 433 .. _amdgpu-target-features:
 434
 435 Target Features
 436 ---------------
 437
 438 Target features control how code is generated to support certain
 439 processor specific features. Not all target features are supported by
 440 all processors. The runtime must ensure that the features supported by
 441 the device used to execute the code match the features enabled when
 442 generating the code. A mismatch of features may result in incorrect
 443 execution, or a reduction in performance.
 444
 445 The target features supported by each processor is listed in
 446 :ref:`amdgpu-processor-table`.
 447
 448 Target features are controlled by exactly one of the following Clang
 449 options:
 450
 451 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
 452
 453   The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
 454   optional components of the target ID. If omitted, the target feature has the
 455   ``any`` value. See :ref:`amdgpu-target-id`.
 456
 457 ``-m[no-]<target-feature>``
 458
 459   Target features not specified by the target ID are specified using a
 460   separate option. These target features can have an ``on`` or ``off``
 461   value.  ``on`` is specified by omitting the ``no-`` prefix, and
 462   ``off`` is specified by including the ``no-`` prefix. The default
 463   if not specified is ``off``.
 464
 465 For example:
 466
 467 ``-mcpu=gfx908:xnack+``
 468   Enable the ``xnack`` feature.
 469 ``-mcpu=gfx908:xnack-``
 470   Disable the ``xnack`` feature.
 471 ``-mcumode``
 472   Enable the ``cumode`` feature.
 473 ``-mno-cumode``
 474   Disable the ``cumode`` feature.
 475
 476   .. table:: AMDGPU Target Features
 477      :name: amdgpu-target-features-table
 478
 479      =============== ============================ ==================================================
 480      Target Feature  Clang Option to Control      Description
 481      Name
 482      =============== ============================ ==================================================
 483      cumode          - ``-m[no-]cumode``          Control the wavefront execution mode used
 484                                                   when generating code for kernels. When disabled
 485                                                   native WGP wavefront execution mode is used,
 486                                                   when enabled CU wavefront execution mode is used
 487                                                   (see :ref:`amdgpu-amdhsa-memory-model`).
 488
 489      sramecc         - ``-mcpu``                  If specified, generate code that can only be
 490                      - ``--offload-arch``         loaded and executed in a process that has a
 491                                                   matching setting for SRAMECC.
 492
 493                                                   If not specified for code object V2 to V3, generate
 494                                                   code that can be loaded and executed in a process
 495                                                   with SRAMECC enabled.
 496
 497                                                   If not specified for code object V4, generate
 498                                                   code that can be loaded and executed in a process
 499                                                   with either setting of SRAMECC.
 500
 501      tgsplit           ``-m[no-]tgsplit``         Enable/disable generating code that assumes
 502                                                   work-groups are launched in threadgroup split mode.
 503                                                   When enabled the waves of a work-group may be
 504                                                   launched in different CUs.
 505
 506      wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
 507                                                   generating code for kernels. When disabled
 508                                                   native wavefront size 32 is used, when enabled
 509                                                   wavefront size 64 is used.
 510
 511      xnack           - ``-mcpu``                  If specified, generate code that can only be
 512                      - ``--offload-arch``         loaded and executed in a process that has a
 513                                                   matching setting for XNACK replay.
 514
 515                                                   If not specified for code object V2 to V3, generate
 516                                                   code that can be loaded and executed in a process
 517                                                   with XNACK replay enabled.
 518
 519                                                   If not specified for code object V4, generate
 520                                                   code that can be loaded and executed in a process
 521                                                   with either setting of XNACK replay.
 522
 523                                                   XNACK replay can be used for demand paging and
 524                                                   page migration. If enabled in the device, then if
 525                                                   a page fault occurs the code may execute
 526                                                   incorrectly unless generated with XNACK replay
 527                                                   enabled, or generated for code object V4 without
 528                                                   specifying XNACK replay. Executing code that was
 529                                                   generated with XNACK replay enabled, or generated
 530                                                   for code object V4 without specifying XNACK replay,
 531                                                   on a device that does not have XNACK replay
 532                                                   enabled will execute correctly but may be less
 533                                                   performant than code generated for XNACK replay
 534                                                   disabled.
 535      =============== ============================ ==================================================
 536
 537 .. _amdgpu-target-id:
 538
 539 Target ID
 540 ---------
 541
 542 AMDGPU supports target IDs. See `Clang Offload Bundler
 543 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
 544 description. The AMDGPU target specific information is:
 545
 546 **processor**
 547   Is an AMDGPU processor or alternative processor name specified in
 548   :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
 549   the primary processor and alternative processor names. The canonical form
 550   target ID only allow the primary processor name.
 551
 552 **target-feature**
 553   Is a target feature name specified in :ref:`amdgpu-target-features-table` that
 554   is supported by the processor. The target features supported by each processor
 555   is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
 556   a target ID are marked as being controlled by ``-mcpu`` and
 557   ``--offload-arch``. Each target feature must appear at most once in a target
 558   ID. The non-canonical form target ID allows the target features to be
 559   specified in any order. The canonical form target ID requires the target
 560   features to be specified in alphabetic order.
 561
 562 .. _amdgpu-target-id-v2-v3:
 563
 564 Code Object V2 to V3 Target ID
 565 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 566
 567 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
 568 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
 569 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
 570 directive and the bundle entry ID. In those cases it has the following BNF
 571 syntax:
 572
 573 .. code::
 574
 575   <target-id> ::== <processor> ( "+" <target-feature> )*
 576
 577 Where a target feature is omitted if *Off* and present if *On* or *Any*.
 578
 579 .. note::
 580
 581   The code object V2 to V3 cannot represent *Any* and treats it the same as
 582   *On*.
 583
 584 .. _amdgpu-embedding-bundled-objects:
 585
 586 Embedding Bundled Code Objects
 587 ------------------------------
 588
 589 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
 590 as described in `Clang Offload Bundler
 591 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
 592
 593 .. note::
 594
 595   The target ID syntax used for code object V2 to V3 for a bundle entry ID
 596   differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
 597
 598 .. _amdgpu-address-spaces:
 599
 600 Address Spaces
 601 --------------
 602
 603 The AMDGPU architecture supports a number of memory address spaces. The address
 604 space names use the OpenCL standard names, with some additions.
 605
 606 The AMDGPU address spaces correspond to target architecture specific LLVM
 607 address space numbers used in LLVM IR.
 608
 609 The AMDGPU address spaces are described in
 610 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
 611 supported for the ``amdgcn`` target.
 612
 613   .. table:: AMDGPU Address Spaces
 614      :name: amdgpu-address-spaces-table
 615
 616      ================================= =============== =========== ================ ======= ============================
 617      ..                                                                                     64-Bit Process Address Space
 618      --------------------------------- --------------- ----------- ---------------- ------------------------------------
 619      Address Space Name                LLVM IR Address HSA Segment Hardware         Address NULL Value
 620                                        Space Number    Name        Name             Size
 621      ================================= =============== =========== ================ ======= ============================
 622      Generic                           0               flat        flat             64      0x0000000000000000
 623      Global                            1               global      global           64      0x0000000000000000
 624      Region                            2               N/A         GDS              32      *not implemented for AMDHSA*
 625      Local                             3               group       LDS              32      0xFFFFFFFF
 626      Constant                          4               constant    *same as global* 64      0x0000000000000000
 627      Private                           5               private     scratch          32      0xFFFFFFFF
 628      Constant 32-bit                   6               *TODO*                               0x00000000
 629      Buffer Fat Pointer (experimental) 7               *TODO*
 630      ================================= =============== =========== ================ ======= ============================
 631
 632 **Generic**
 633   The generic address space is supported unless the *Target Properties* column
 634   of :ref:`amdgpu-processor-table` specifies *Does not support generic address
 635   space*.
 636
 637   The generic address space uses the hardware flat address support for two fixed
 638   ranges of virtual addresses (the private and local apertures), that are
 639   outside the range of addressable global memory, to map from a flat address to
 640   a private or local address. This uses FLAT instructions that can take a flat
 641   address and access global, private (scratch), and group (LDS) memory depending
 642   on if the address is within one of the aperture ranges.
 643
 644   Flat access to scratch requires hardware aperture setup and setup in the
 645   kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
 646   access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
 647   setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
 648
 649   To convert between a private or group address space address (termed a segment
 650   address) and a flat address the base address of the corresponding aperture
 651   can be used. For GFX7-GFX8 these are available in the
 652   :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
 653   Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
 654   GFX9-GFX10 the aperture base addresses are directly available as inline
 655   constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
 656   In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
 657   aligned to 2^32 which makes it easier to convert from flat to segment or
 658   segment to flat.
 659
 660   A global address space address has the same value when used as a flat address
 661   so no conversion is needed.
 662
 663 **Global and Constant**
 664   The global and constant address spaces both use global virtual addresses,
 665   which are the same virtual address space used by the CPU. However, some
 666   virtual addresses may only be accessible to the CPU, some only accessible
 667   by the GPU, and some by both.
 668
 669   Using the constant address space indicates that the data will not change
 670   during the execution of the kernel. This allows scalar read instructions to
 671   be used. As the constant address space could only be modified on the host
 672   side, a generic pointer loaded from the constant address space is safe to be
 673   assumed as a global pointer since only the device global memory is visible
 674   and managed on the host side. The vector and scalar L1 caches are invalidated
 675   of volatile data before each kernel dispatch execution to allow constant
 676   memory to change values between kernel dispatches.
 677
 678 **Region**
 679   The region address space uses the hardware Global Data Store (GDS). All
 680   wavefronts executing on the same device will access the same memory for any
 681   given region address. However, the same region address accessed by wavefronts
 682   executing on different devices will access different memory. It is higher
 683   performance than global memory. It is allocated by the runtime. The data
 684   store (DS) instructions can be used to access it.
 685
 686 **Local**
 687   The local address space uses the hardware Local Data Store (LDS) which is
 688   automatically allocated when the hardware creates the wavefronts of a
 689   work-group, and freed when all the wavefronts of a work-group have
 690   terminated. All wavefronts belonging to the same work-group will access the
 691   same memory for any given local address. However, the same local address
 692   accessed by wavefronts belonging to different work-groups will access
 693   different memory. It is higher performance than global memory. The data store
 694   (DS) instructions can be used to access it.
 695
 696 **Private**
 697   The private address space uses the hardware scratch memory support which
 698   automatically allocates memory when it creates a wavefront and frees it when
 699   a wavefronts terminates. The memory accessed by a lane of a wavefront for any
 700   given private address will be different to the memory accessed by another lane
 701   of the same or different wavefront for the same private address.
 702
 703   If a kernel dispatch uses scratch, then the hardware allocates memory from a
 704   pool of backing memory allocated by the runtime for each wavefront. The lanes
 705   of the wavefront access this using dword (4 byte) interleaving. The mapping
 706   used from private address to backing memory address is:
 707
 708     ``wavefront-scratch-base +
 709     ((private-address / 4) * wavefront-size * 4) +
 710     (wavefront-lane-id * 4) + (private-address % 4)``
 711
 712   If each lane of a wavefront accesses the same private address, the
 713   interleaving results in adjacent dwords being accessed and hence requires
 714   fewer cache lines to be fetched.
 715
 716   There are different ways that the wavefront scratch base address is
 717   determined by a wavefront (see
 718   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 719
 720   Scratch memory can be accessed in an interleaved manner using buffer
 721   instructions with the scratch buffer descriptor and per wavefront scratch
 722   offset, by the scratch instructions, or by flat instructions. Multi-dword
 723   access is not supported except by flat and scratch instructions in
 724   GFX9-GFX10.
 725
 726 **Constant 32-bit**
 727   *TODO*
 728
 729 **Buffer Fat Pointer**
 730   The buffer fat pointer is an experimental address space that is currently
 731   unsupported in the backend. It exposes a non-integral pointer that is in
 732   the future intended to support the modelling of 128-bit buffer descriptors
 733   plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
 734   *pointer*), allowing normal LLVM load/store/atomic operations to be used to
 735   model the buffer descriptors used heavily in graphics workloads targeting
 736   the backend.
 737
 738 .. _amdgpu-memory-scopes:
 739
 740 Memory Scopes
 741 -------------
 742
 743 This section provides LLVM memory synchronization scopes supported by the AMDGPU
 744 backend memory model when the target triple OS is ``amdhsa`` (see
 745 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 746
 747 The memory model supported is based on the HSA memory model [HSA]_ which is
 748 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 749 relation is transitive over the synchronizes-with relation independent of scope
 750 and synchronizes-with allows the memory scope instances to be inclusive (see
 751 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
 752
 753 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 754 inclusion and requires the memory scopes to exactly match. However, this
 755 is conservatively correct for OpenCL.
 756
 757   .. table:: AMDHSA LLVM Sync Scopes
 758      :name: amdgpu-amdhsa-llvm-sync-scopes-table
 759
 760      ======================= ===================================================
 761      LLVM Sync Scope         Description
 762      ======================= ===================================================
 763      *none*                  The default: ``system``.
 764
 765                              Synchronizes with, and participates in modification
 766                              and seq_cst total orderings with, other operations
 767                              (except image operations) for all address spaces
 768                              (except private, or generic that accesses private)
 769                              provided the other operation's sync scope is:
 770
 771                              - ``system``.
 772                              - ``agent`` and executed by a thread on the same
 773                                agent.
 774                              - ``workgroup`` and executed by a thread in the
 775                                same work-group.
 776                              - ``wavefront`` and executed by a thread in the
 777                                same wavefront.
 778
 779      ``agent``               Synchronizes with, and participates in modification
 780                              and seq_cst total orderings with, other operations
 781                              (except image operations) for all address spaces
 782                              (except private, or generic that accesses private)
 783                              provided the other operation's sync scope is:
 784
 785                              - ``system`` or ``agent`` and executed by a thread
 786                                on the same agent.
 787                              - ``workgroup`` and executed by a thread in the
 788                                same work-group.
 789                              - ``wavefront`` and executed by a thread in the
 790                                same wavefront.
 791
 792      ``workgroup``           Synchronizes with, and participates in modification
 793                              and seq_cst total orderings with, other operations
 794                              (except image operations) for all address spaces
 795                              (except private, or generic that accesses private)
 796                              provided the other operation's sync scope is:
 797
 798                              - ``system``, ``agent`` or ``workgroup`` and
 799                                executed by a thread in the same work-group.
 800                              - ``wavefront`` and executed by a thread in the
 801                                same wavefront.
 802
 803      ``wavefront``           Synchronizes with, and participates in modification
 804                              and seq_cst total orderings with, other operations
 805                              (except image operations) for all address spaces
 806                              (except private, or generic that accesses private)
 807                              provided the other operation's sync scope is:
 808
 809                              - ``system``, ``agent``, ``workgroup`` or
 810                                ``wavefront`` and executed by a thread in the
 811                                same wavefront.
 812
 813      ``singlethread``        Only synchronizes with and participates in
 814                              modification and seq_cst total orderings with,
 815                              other operations (except image operations) running
 816                              in the same thread for all address spaces (for
 817                              example, in signal handlers).
 818
 819      ``one-as``              Same as ``system`` but only synchronizes with other
 820                              operations within the same address space.
 821
 822      ``agent-one-as``        Same as ``agent`` but only synchronizes with other
 823                              operations within the same address space.
 824
 825      ``workgroup-one-as``    Same as ``workgroup`` but only synchronizes with
 826                              other operations within the same address space.
 827
 828      ``wavefront-one-as``    Same as ``wavefront`` but only synchronizes with
 829                              other operations within the same address space.
 830
 831      ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
 832                              other operations within the same address space.
 833      ======================= ===================================================
 834
 835 LLVM IR Intrinsics
 836 ------------------
 837
 838 The AMDGPU backend implements the following LLVM IR intrinsics.
 839
 840 *This section is WIP.*
 841
 842 .. TODO::
 843
 844    List AMDGPU intrinsics.
 845
 846 LLVM IR Attributes
 847 ------------------
 848
 849 The AMDGPU backend supports the following LLVM IR attributes.
 850
 851   .. table:: AMDGPU LLVM IR Attributes
 852      :name: amdgpu-llvm-ir-attributes-table
 853
 854      ======================================= ==========================================================
 855      LLVM Attribute                          Description
 856      ======================================= ==========================================================
 857      "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
 858                                              will be specified when the kernel is dispatched. Generated
 859                                              by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
 860                                              The implied default value is 1,1024.
 861
 862      "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
 863                                              argument block size for the implicit arguments. This
 864                                              varies by OS and language (for OpenCL see
 865                                              :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
 866      "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
 867                                              the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
 868      "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
 869                                              ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
 870      "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
 871                                              execution unit. Generated by the ``amdgpu_waves_per_eu``
 872                                              CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
 873                                              and the backend may not be able to satisfy the request. If
 874                                              the specified range is incompatible with the function's
 875                                              "amdgpu-flat-work-group-size" value, the implied occupancy
 876                                              bounds by the workgroup size takes precedence.
 877
 878      "amdgpu-ieee" true/false.               Specify whether the function expects the IEEE field of the
 879                                              mode register to be set on entry. Overrides the default for
 880                                              the calling convention.
 881      "amdgpu-dx10-clamp" true/false.         Specify whether the function expects the DX10_CLAMP field of
 882                                              the mode register to be set on entry. Overrides the default
 883                                              for the calling convention.
 884
 885      "amdgpu-no-workitem-id-x"               Indicates the function does not depend on the value of the
 886                                              llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
 887                                              attribute, or reached through a call site marked with this attribute,
 888                                              the value returned by the intrinsic is undefined. The backend can
 889                                              generally infer this during code generation, so typically there is no
 890                                              benefit to frontends marking functions with this.
 891
 892      "amdgpu-no-workitem-id-y"               The same as amdgpu-no-workitem-id-x, except for the
 893                                              llvm.amdgcn.workitem.id.y intrinsic.
 894
 895      "amdgpu-no-workitem-id-z"               The same as amdgpu-no-workitem-id-x, except for the
 896                                              llvm.amdgcn.workitem.id.z intrinsic.
 897
 898      "amdgpu-no-workgroup-id-x"              The same as amdgpu-no-workitem-id-x, except for the
 899                                              llvm.amdgcn.workgroup.id.x intrinsic.
 900
 901      "amdgpu-no-workgroup-id-y"              The same as amdgpu-no-workitem-id-x, except for the
 902                                              llvm.amdgcn.workgroup.id.y intrinsic.
 903
 904      "amdgpu-no-workgroup-id-z"              The same as amdgpu-no-workitem-id-x, except for the
 905                                              llvm.amdgcn.workgroup.id.z intrinsic.
 906
 907      "amdgpu-no-dispatch-ptr"                The same as amdgpu-no-workitem-id-x, except for the
 908                                              llvm.amdgcn.dispatch.ptr intrinsic.
 909
 910      "amdgpu-no-implicitarg-ptr"             The same as amdgpu-no-workitem-id-x, except for the
 911                                              llvm.amdgcn.implicitarg.ptr intrinsic.
 912
 913      "amdgpu-no-dispatch-id"                 The same as amdgpu-no-workitem-id-x, except for the
 914                                              llvm.amdgcn.dispatch.id intrinsic.
 915
 916      "amdgpu-no-queue-ptr"                   Similar to amdgpu-no-workitem-id-x, except for the
 917                                              llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
 918                                              attributes, the queue pointer may be required in situations where the
 919                                              intrinsic call does not directly appear in the program. Some subtargets
 920                                              require the queue pointer for to handle some addrspacecasts, as well
 921                                              as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
 922                                              llvm.debug intrinsics.
 923
 924      ======================================= ==========================================================
 925
 926 .. _amdgpu-elf-code-object:
 927
 928 ELF Code Object
 929 ===============
 930
 931 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
 932 can be linked by ``lld`` to produce a standard ELF shared code object which can
 933 be loaded and executed on an AMDGPU target.
 934
 935 .. _amdgpu-elf-header:
 936
 937 Header
 938 ------
 939
 940 The AMDGPU backend uses the following ELF header:
 941
 942   .. table:: AMDGPU ELF Header
 943      :name: amdgpu-elf-header-table
 944
 945      ========================== ===============================
 946      Field                      Value
 947      ========================== ===============================
 948      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
 949      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
 950      ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
 951                                 - ``ELFOSABI_AMDGPU_HSA``
 952                                 - ``ELFOSABI_AMDGPU_PAL``
 953                                 - ``ELFOSABI_AMDGPU_MESA3D``
 954      ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
 955                                 - ``ELFABIVERSION_AMDGPU_HSA_V3``
 956                                 - ``ELFABIVERSION_AMDGPU_HSA_V4``
 957                                 - ``ELFABIVERSION_AMDGPU_PAL``
 958                                 - ``ELFABIVERSION_AMDGPU_MESA3D``
 959      ``e_type``                 - ``ET_REL``
 960                                 - ``ET_DYN``
 961      ``e_machine``              ``EM_AMDGPU``
 962      ``e_entry``                0
 963      ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-v2-table`,
 964                                 :ref:`amdgpu-elf-header-e_flags-table-v3`,
 965                                 and :ref:`amdgpu-elf-header-e_flags-table-v4`
 966      ========================== ===============================
 967
 968 ..
 969
 970   .. table:: AMDGPU ELF Header Enumeration Values
 971      :name: amdgpu-elf-header-enumeration-values-table
 972
 973      =============================== =====
 974      Name                            Value
 975      =============================== =====
 976      ``EM_AMDGPU``                   224
 977      ``ELFOSABI_NONE``               0
 978      ``ELFOSABI_AMDGPU_HSA``         64
 979      ``ELFOSABI_AMDGPU_PAL``         65
 980      ``ELFOSABI_AMDGPU_MESA3D``      66
 981      ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
 982      ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
 983      ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
 984      ``ELFABIVERSION_AMDGPU_PAL``    0
 985      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
 986      =============================== =====
 987
 988 ``e_ident[EI_CLASS]``
 989   The ELF class is:
 990
 991   * ``ELFCLASS32`` for ``r600`` architecture.
 992
 993   * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
 994     process address space applications.
 995
 996 ``e_ident[EI_DATA]``
 997   All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
 998
 999 ``e_ident[EI_OSABI]``
1000   One of the following AMDGPU target architecture specific OS ABIs
1001   (see :ref:`amdgpu-os`):
1002
1003   * ``ELFOSABI_NONE`` for *unknown* OS.
1004
1005   * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1006
1007   * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1008
1009   * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1010
1011 ``e_ident[EI_ABIVERSION]``
1012   The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1013   object conforms:
1014
1015   * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1016     runtime ABI for code object V2. Specify using the Clang option
1017     ``-mcode-object-version=2``.
1018
1019   * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1020     runtime ABI for code object V3. Specify using the Clang option
1021     ``-mcode-object-version=3``. This is the default code object
1022     version if not specified.
1023
1024   * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1025     runtime ABI for code object V4. Specify using the Clang option
1026     ``-mcode-object-version=4``.
1027
1028   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1029     runtime ABI.
1030
1031   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1032     3D runtime ABI.
1033
1034 ``e_type``
1035   Can be one of the following values:
1036
1037
1038   ``ET_REL``
1039     The type produced by the AMDGPU backend compiler as it is relocatable code
1040     object.
1041
1042   ``ET_DYN``
1043     The type produced by the linker as it is a shared code object.
1044
1045   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1046
1047 ``e_machine``
1048   The value ``EM_AMDGPU`` is used for the machine for all processors supported
1049   by the ``r600`` and ``amdgcn`` architectures (see
1050   :ref:`amdgpu-processor-table`). The specific processor is specified in the
1051   ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1052   :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1053   ``e_flags`` for code object V3 to V4 (see
1054   :ref:`amdgpu-elf-header-e_flags-table-v3` and
1055   :ref:`amdgpu-elf-header-e_flags-table-v4`).
1056
1057 ``e_entry``
1058   The entry point is 0 as the entry points for individual kernels must be
1059   selected in order to invoke them through AQL packets.
1060
1061 ``e_flags``
1062   The AMDGPU backend uses the following ELF header flags:
1063
1064   .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1065      :name: amdgpu-elf-header-e_flags-v2-table
1066
1067      ===================================== ===== =============================
1068      Name                                  Value Description
1069      ===================================== ===== =============================
1070      ``EF_AMDGPU_FEATURE_XNACK_V2``        0x01  Indicates if the ``xnack``
1071                                                  target feature is
1072                                                  enabled for all code
1073                                                  contained in the code object.
1074                                                  If the processor
1075                                                  does not support the
1076                                                  ``xnack`` target
1077                                                  feature then must
1078                                                  be 0.
1079                                                  See
1080                                                  :ref:`amdgpu-target-features`.
1081      ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02  Indicates if the trap
1082                                                  handler is enabled for all
1083                                                  code contained in the code
1084                                                  object. If the processor
1085                                                  does not support a trap
1086                                                  handler then must be 0.
1087                                                  See
1088                                                  :ref:`amdgpu-target-features`.
1089      ===================================== ===== =============================
1090
1091   .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1092      :name: amdgpu-elf-header-e_flags-table-v3
1093
1094      ================================= ===== =============================
1095      Name                              Value Description
1096      ================================= ===== =============================
1097      ``EF_AMDGPU_MACH``                0x0ff AMDGPU processor selection
1098                                              mask for
1099                                              ``EF_AMDGPU_MACH_xxx`` values
1100                                              defined in
1101                                              :ref:`amdgpu-ef-amdgpu-mach-table`.
1102      ``EF_AMDGPU_FEATURE_XNACK_V3``    0x100 Indicates if the ``xnack``
1103                                              target feature is
1104                                              enabled for all code
1105                                              contained in the code object.
1106                                              If the processor
1107                                              does not support the
1108                                              ``xnack`` target
1109                                              feature then must
1110                                              be 0.
1111                                              See
1112                                              :ref:`amdgpu-target-features`.
1113      ``EF_AMDGPU_FEATURE_SRAMECC_V3``  0x200 Indicates if the ``sramecc``
1114                                              target feature is
1115                                              enabled for all code
1116                                              contained in the code object.
1117                                              If the processor
1118                                              does not support the
1119                                              ``sramecc`` target
1120                                              feature then must
1121                                              be 0.
1122                                              See
1123                                              :ref:`amdgpu-target-features`.
1124      ================================= ===== =============================
1125
1126   .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4
1127      :name: amdgpu-elf-header-e_flags-table-v4
1128
1129      ============================================ ===== ===================================
1130      Name                                         Value      Description
1131      ============================================ ===== ===================================
1132      ``EF_AMDGPU_MACH``                           0x0ff AMDGPU processor selection
1133                                                         mask for
1134                                                         ``EF_AMDGPU_MACH_xxx`` values
1135                                                         defined in
1136                                                         :ref:`amdgpu-ef-amdgpu-mach-table`.
1137      ``EF_AMDGPU_FEATURE_XNACK_V4``               0x300 XNACK selection mask for
1138                                                         ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1139                                                         values.
1140      ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4``   0x000 XNACK unsuppored.
1141      ``EF_AMDGPU_FEATURE_XNACK_ANY_V4``           0x100 XNACK can have any value.
1142      ``EF_AMDGPU_FEATURE_XNACK_OFF_V4``           0x200 XNACK disabled.
1143      ``EF_AMDGPU_FEATURE_XNACK_ON_V4``            0x300 XNACK enabled.
1144      ``EF_AMDGPU_FEATURE_SRAMECC_V4``             0xc00 SRAMECC selection mask for
1145                                                         ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1146                                                         values.
1147      ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
1148      ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4``         0x400 SRAMECC can have any value.
1149      ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4``         0x800 SRAMECC disabled,
1150      ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4``          0xc00 SRAMECC enabled.
1151      ============================================ ===== ===================================
1152
1153   .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1154      :name: amdgpu-ef-amdgpu-mach-table
1155
1156      ==================================== ========== =============================
1157      Name                                 Value      Description (see
1158                                                      :ref:`amdgpu-processor-table`)
1159      ==================================== ========== =============================
1160      ``EF_AMDGPU_MACH_NONE``              0x000      *not specified*
1161      ``EF_AMDGPU_MACH_R600_R600``         0x001      ``r600``
1162      ``EF_AMDGPU_MACH_R600_R630``         0x002      ``r630``
1163      ``EF_AMDGPU_MACH_R600_RS880``        0x003      ``rs880``
1164      ``EF_AMDGPU_MACH_R600_RV670``        0x004      ``rv670``
1165      ``EF_AMDGPU_MACH_R600_RV710``        0x005      ``rv710``
1166      ``EF_AMDGPU_MACH_R600_RV730``        0x006      ``rv730``
1167      ``EF_AMDGPU_MACH_R600_RV770``        0x007      ``rv770``
1168      ``EF_AMDGPU_MACH_R600_CEDAR``        0x008      ``cedar``
1169      ``EF_AMDGPU_MACH_R600_CYPRESS``      0x009      ``cypress``
1170      ``EF_AMDGPU_MACH_R600_JUNIPER``      0x00a      ``juniper``
1171      ``EF_AMDGPU_MACH_R600_REDWOOD``      0x00b      ``redwood``
1172      ``EF_AMDGPU_MACH_R600_SUMO``         0x00c      ``sumo``
1173      ``EF_AMDGPU_MACH_R600_BARTS``        0x00d      ``barts``
1174      ``EF_AMDGPU_MACH_R600_CAICOS``       0x00e      ``caicos``
1175      ``EF_AMDGPU_MACH_R600_CAYMAN``       0x00f      ``cayman``
1176      ``EF_AMDGPU_MACH_R600_TURKS``        0x010      ``turks``
1177      *reserved*                           0x011 -    Reserved for ``r600``
1178                                           0x01f      architecture processors.
1179      ``EF_AMDGPU_MACH_AMDGCN_GFX600``     0x020      ``gfx600``
1180      ``EF_AMDGPU_MACH_AMDGCN_GFX601``     0x021      ``gfx601``
1181      ``EF_AMDGPU_MACH_AMDGCN_GFX700``     0x022      ``gfx700``
1182      ``EF_AMDGPU_MACH_AMDGCN_GFX701``     0x023      ``gfx701``
1183      ``EF_AMDGPU_MACH_AMDGCN_GFX702``     0x024      ``gfx702``
1184      ``EF_AMDGPU_MACH_AMDGCN_GFX703``     0x025      ``gfx703``
1185      ``EF_AMDGPU_MACH_AMDGCN_GFX704``     0x026      ``gfx704``
1186      *reserved*                           0x027      Reserved.
1187      ``EF_AMDGPU_MACH_AMDGCN_GFX801``     0x028      ``gfx801``
1188      ``EF_AMDGPU_MACH_AMDGCN_GFX802``     0x029      ``gfx802``
1189      ``EF_AMDGPU_MACH_AMDGCN_GFX803``     0x02a      ``gfx803``
1190      ``EF_AMDGPU_MACH_AMDGCN_GFX810``     0x02b      ``gfx810``
1191      ``EF_AMDGPU_MACH_AMDGCN_GFX900``     0x02c      ``gfx900``
1192      ``EF_AMDGPU_MACH_AMDGCN_GFX902``     0x02d      ``gfx902``
1193      ``EF_AMDGPU_MACH_AMDGCN_GFX904``     0x02e      ``gfx904``
1194      ``EF_AMDGPU_MACH_AMDGCN_GFX906``     0x02f      ``gfx906``
1195      ``EF_AMDGPU_MACH_AMDGCN_GFX908``     0x030      ``gfx908``
1196      ``EF_AMDGPU_MACH_AMDGCN_GFX909``     0x031      ``gfx909``
1197      ``EF_AMDGPU_MACH_AMDGCN_GFX90C``     0x032      ``gfx90c``
1198      ``EF_AMDGPU_MACH_AMDGCN_GFX1010``    0x033      ``gfx1010``
1199      ``EF_AMDGPU_MACH_AMDGCN_GFX1011``    0x034      ``gfx1011``
1200      ``EF_AMDGPU_MACH_AMDGCN_GFX1012``    0x035      ``gfx1012``
1201      ``EF_AMDGPU_MACH_AMDGCN_GFX1030``    0x036      ``gfx1030``
1202      ``EF_AMDGPU_MACH_AMDGCN_GFX1031``    0x037      ``gfx1031``
1203      ``EF_AMDGPU_MACH_AMDGCN_GFX1032``    0x038      ``gfx1032``
1204      ``EF_AMDGPU_MACH_AMDGCN_GFX1033``    0x039      ``gfx1033``
1205      ``EF_AMDGPU_MACH_AMDGCN_GFX602``     0x03a      ``gfx602``
1206      ``EF_AMDGPU_MACH_AMDGCN_GFX705``     0x03b      ``gfx705``
1207      ``EF_AMDGPU_MACH_AMDGCN_GFX805``     0x03c      ``gfx805``
1208      ``EF_AMDGPU_MACH_AMDGCN_GFX1035``    0x03d      ``gfx1035``
1209      ``EF_AMDGPU_MACH_AMDGCN_GFX1034``    0x03e      ``gfx1034``
1210      ``EF_AMDGPU_MACH_AMDGCN_GFX90A``     0x03f      ``gfx90a``
1211      *reserved*                           0x040      Reserved.
1212      *reserved*                           0x041      Reserved.
1213      ``EF_AMDGPU_MACH_AMDGCN_GFX1013``    0x042      ``gfx1013``
1214      *reserved*                           0x043      Reserved.
1215      *reserved*                           0x044      Reserved.
1216      *reserved*                           0x045      Reserved.
1217      ==================================== ========== =============================
1218
1219 Sections
1220 --------
1221
1222 An AMDGPU target ELF code object has the standard ELF sections which include:
1223
1224   .. table:: AMDGPU ELF Sections
1225      :name: amdgpu-elf-sections-table
1226
1227      ================== ================ =================================
1228      Name               Type             Attributes
1229      ================== ================ =================================
1230      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
1231      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1232      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
1233      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
1234      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1235      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1236      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1237      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
1238      ``.note``          ``SHT_NOTE``     *none*
1239      ``.rela``\ *name*  ``SHT_RELA``     *none*
1240      ``.rela.dyn``      ``SHT_RELA``     *none*
1241      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
1242      ``.shstrtab``      ``SHT_STRTAB``   *none*
1243      ``.strtab``        ``SHT_STRTAB``   *none*
1244      ``.symtab``        ``SHT_SYMTAB``   *none*
1245      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1246      ================== ================ =================================
1247
1248 These sections have their standard meanings (see [ELF]_) and are only generated
1249 if needed.
1250
1251 ``.debug``\ *\**
1252   The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1253   information on the DWARF produced by the AMDGPU backend.
1254
1255 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1256   The standard sections used by a dynamic loader.
1257
1258 ``.note``
1259   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1260   backend.
1261
1262 ``.rela``\ *name*, ``.rela.dyn``
1263   For relocatable code objects, *name* is the name of the section that the
1264   relocation records apply. For example, ``.rela.text`` is the section name for
1265   relocation records associated with the ``.text`` section.
1266
1267   For linked shared code objects, ``.rela.dyn`` contains all the relocation
1268   records from each of the relocatable code object's ``.rela``\ *name* sections.
1269
1270   See :ref:`amdgpu-relocation-records` for the relocation records supported by
1271   the AMDGPU backend.
1272
1273 ``.text``
1274   The executable machine code for the kernels and functions they call. Generated
1275   as position independent code. See :ref:`amdgpu-code-conventions` for
1276   information on conventions used in the isa generation.
1277
1278 .. _amdgpu-note-records:
1279
1280 Note Records
1281 ------------
1282
1283 The AMDGPU backend code object contains ELF note records in the ``.note``
1284 section. The set of generated notes and their semantics depend on the code
1285 object version; see :ref:`amdgpu-note-records-v2` and
1286 :ref:`amdgpu-note-records-v3-v4`.
1287
1288 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1289 must be generated after the ``name`` field to ensure the ``desc`` field is 4
1290 byte aligned. In addition, minimal zero-byte padding must be generated to
1291 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1292 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1293 alignment.
1294
1295 .. _amdgpu-note-records-v2:
1296
1297 Code Object V2 Note Records
1298 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1299
1300 .. warning::
1301   Code object V2 is not the default code object version emitted by
1302   this version of LLVM.
1303
1304 The AMDGPU backend code object uses the following ELF note record in the
1305 ``.note`` section when compiling for code object V2.
1306
1307 The note record vendor field is "AMD".
1308
1309 Additional note records may be present, but any which are not documented here
1310 are deprecated and should not be used.
1311
1312   .. table:: AMDGPU Code Object V2 ELF Note Records
1313      :name: amdgpu-elf-note-records-v2-table
1314
1315      ===== ===================================== ======================================
1316      Name  Type                                  Description
1317      ===== ===================================== ======================================
1318      "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION``    Code object version.
1319      "AMD" ``NT_AMD_HSA_HSAIL``                  HSAIL properties generated by the HSAIL
1320                                                  Finalizer and not the LLVM compiler.
1321      "AMD" ``NT_AMD_HSA_ISA_VERSION``            Target ISA version.
1322      "AMD" ``NT_AMD_HSA_METADATA``               Metadata null terminated string in
1323                                                  YAML [YAML]_ textual format.
1324      "AMD" ``NT_AMD_HSA_ISA_NAME``               Target ISA name.
1325      ===== ===================================== ======================================
1326
1327 ..
1328
1329   .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1330      :name: amdgpu-elf-note-record-enumeration-values-v2-table
1331
1332      ===================================== =====
1333      Name                                  Value
1334      ===================================== =====
1335      ``NT_AMD_HSA_CODE_OBJECT_VERSION``    1
1336      ``NT_AMD_HSA_HSAIL``                  2
1337      ``NT_AMD_HSA_ISA_VERSION``            3
1338      *reserved*                            4-9
1339      ``NT_AMD_HSA_METADATA``               10
1340      ``NT_AMD_HSA_ISA_NAME``               11
1341      ===================================== =====
1342
1343 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
1344   Specifies the code object version number. The description field has the
1345   following layout:
1346
1347   .. code:: c
1348
1349     struct amdgpu_hsa_note_code_object_version_s {
1350       uint32_t major_version;
1351       uint32_t minor_version;
1352     };
1353
1354   The ``major_version`` has a value less than or equal to 2.
1355
1356 ``NT_AMD_HSA_HSAIL``
1357   Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1358   field has the following layout:
1359
1360   .. code:: c
1361
1362     struct amdgpu_hsa_note_hsail_s {
1363       uint32_t hsail_major_version;
1364       uint32_t hsail_minor_version;
1365       uint8_t profile;
1366       uint8_t machine_model;
1367       uint8_t default_float_round;
1368     };
1369
1370 ``NT_AMD_HSA_ISA_VERSION``
1371   Specifies the target ISA version. The description field has the following layout:
1372
1373   .. code:: c
1374
1375     struct amdgpu_hsa_note_isa_s {
1376       uint16_t vendor_name_size;
1377       uint16_t architecture_name_size;
1378       uint32_t major;
1379       uint32_t minor;
1380       uint32_t stepping;
1381       char vendor_and_architecture_name[1];
1382     };
1383
1384   ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1385   vendor and architecture names respectively, including the NUL character.
1386
1387   ``vendor_and_architecture_name`` contains the NUL terminates string for the
1388   vendor, immediately followed by the NUL terminated string for the
1389   architecture.
1390
1391   This note record is used by the HSA runtime loader.
1392
1393   Code object V2 only supports a limited number of processors and has fixed
1394   settings for target features. See
1395   :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1396   processors and the corresponding target ID. In the table the note record ISA
1397   name is a concatenation of the vendor name, architecture name, major, minor,
1398   and stepping separated by a ":".
1399
1400   The target ID column shows the processor name and fixed target features used
1401   by the LLVM compiler. The LLVM compiler does not generate a
1402   ``NT_AMD_HSA_HSAIL`` note record.
1403
1404   A code object generated by the Finalizer also uses code object V2 and always
1405   generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1406   ``sramecc`` target feature is as shown in
1407   :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1408   target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1409   bit.
1410
1411 ``NT_AMD_HSA_ISA_NAME``
1412   Specifies the target ISA name as a non-NUL terminated string.
1413
1414   This note record is not used by the HSA runtime loader.
1415
1416   See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1417   V2's limited support of processors and fixed settings for target features.
1418
1419   See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1420   from the string to the corresponding target ID. If the ``xnack`` target
1421   feature is supported and enabled, the string produced by the LLVM compiler
1422   will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1423   instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1424
1425 ``NT_AMD_HSA_METADATA``
1426   Specifies extensible metadata associated with the code objects executed on HSA
1427   [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1428   target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1429   :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1430   metadata string.
1431
1432   .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1433      :name: amdgpu-elf-note-record-supported_processors-v2-table
1434
1435      ===================== ==========================
1436      Note Record ISA Name  Target ID
1437      ===================== ==========================
1438      ``AMD:AMDGPU:6:0:0``  ``gfx600``
1439      ``AMD:AMDGPU:6:0:1``  ``gfx601``
1440      ``AMD:AMDGPU:6:0:2``  ``gfx602``
1441      ``AMD:AMDGPU:7:0:0``  ``gfx700``
1442      ``AMD:AMDGPU:7:0:1``  ``gfx701``
1443      ``AMD:AMDGPU:7:0:2``  ``gfx702``
1444      ``AMD:AMDGPU:7:0:3``  ``gfx703``
1445      ``AMD:AMDGPU:7:0:4``  ``gfx704``
1446      ``AMD:AMDGPU:7:0:5``  ``gfx705``
1447      ``AMD:AMDGPU:8:0:0``  ``gfx802``
1448      ``AMD:AMDGPU:8:0:1``  ``gfx801:xnack+``
1449      ``AMD:AMDGPU:8:0:2``  ``gfx802``
1450      ``AMD:AMDGPU:8:0:3``  ``gfx803``
1451      ``AMD:AMDGPU:8:0:4``  ``gfx803``
1452      ``AMD:AMDGPU:8:0:5``  ``gfx805``
1453      ``AMD:AMDGPU:8:1:0``  ``gfx810:xnack+``
1454      ``AMD:AMDGPU:9:0:0``  ``gfx900:xnack-``
1455      ``AMD:AMDGPU:9:0:1``  ``gfx900:xnack+``
1456      ``AMD:AMDGPU:9:0:2``  ``gfx902:xnack-``
1457      ``AMD:AMDGPU:9:0:3``  ``gfx902:xnack+``
1458      ``AMD:AMDGPU:9:0:4``  ``gfx904:xnack-``
1459      ``AMD:AMDGPU:9:0:5``  ``gfx904:xnack+``
1460      ``AMD:AMDGPU:9:0:6``  ``gfx906:sramecc-:xnack-``
1461      ``AMD:AMDGPU:9:0:7``  ``gfx906:sramecc-:xnack+``
1462      ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1463      ===================== ==========================
1464
1465 .. _amdgpu-note-records-v3-v4:
1466
1467 Code Object V3 to V4 Note Records
1468 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1469
1470 The AMDGPU backend code object uses the following ELF note record in the
1471 ``.note`` section when compiling for code object V3 to V4.
1472
1473 The note record vendor field is "AMDGPU".
1474
1475 Additional note records may be present, but any which are not documented here
1476 are deprecated and should not be used.
1477
1478   .. table:: AMDGPU Code Object V3 to V4 ELF Note Records
1479      :name: amdgpu-elf-note-records-table-v3-v4
1480
1481      ======== ============================== ======================================
1482      Name     Type                           Description
1483      ======== ============================== ======================================
1484      "AMDGPU" ``NT_AMDGPU_METADATA``         Metadata in Message Pack [MsgPack]_
1485                                              binary format.
1486      ======== ============================== ======================================
1487
1488 ..
1489
1490   .. table:: AMDGPU Code Object V3 to V4 ELF Note Record Enumeration Values
1491      :name: amdgpu-elf-note-record-enumeration-values-table-v3-v4
1492
1493      ============================== =====
1494      Name                           Value
1495      ============================== =====
1496      *reserved*                     0-31
1497      ``NT_AMDGPU_METADATA``         32
1498      ============================== =====
1499
1500 ``NT_AMDGPU_METADATA``
1501   Specifies extensible metadata associated with an AMDGPU code object. It is
1502   encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1503   :ref:`amdgpu-amdhsa-code-object-metadata-v3` and
1504   :ref:`amdgpu-amdhsa-code-object-metadata-v4` for the map keys defined for the
1505   ``amdhsa`` OS.
1506
1507 .. _amdgpu-symbols:
1508
1509 Symbols
1510 -------
1511
1512 Symbols include the following:
1513
1514   .. table:: AMDGPU ELF Symbols
1515      :name: amdgpu-elf-symbols-table
1516
1517      ===================== ================== ================ ==================
1518      Name                  Type               Section          Description
1519      ===================== ================== ================ ==================
1520      *link-name*           ``STT_OBJECT``     - ``.data``      Global variable
1521                                               - ``.rodata``
1522                                               - ``.bss``
1523      *link-name*\ ``.kd``  ``STT_OBJECT``     - ``.rodata``    Kernel descriptor
1524      *link-name*           ``STT_FUNC``       - ``.text``      Kernel entry point
1525      *link-name*           ``STT_OBJECT``     - SHN_AMDGPU_LDS Global variable in LDS
1526      ===================== ================== ================ ==================
1527
1528 Global variable
1529   Global variables both used and defined by the compilation unit.
1530
1531   If the symbol is defined in the compilation unit then it is allocated in the
1532   appropriate section according to if it has initialized data or is readonly.
1533
1534   If the symbol is external then its section is ``STN_UNDEF`` and the loader
1535   will resolve relocations using the definition provided by another code object
1536   or explicitly defined by the runtime.
1537
1538   If the symbol resides in local/group memory (LDS) then its section is the
1539   special processor specific section name ``SHN_AMDGPU_LDS``, and the
1540   ``st_value`` field describes alignment requirements as it does for common
1541   symbols.
1542
1543   .. TODO::
1544
1545      Add description of linked shared object symbols. Seems undefined symbols
1546      are marked as STT_NOTYPE.
1547
1548 Kernel descriptor
1549   Every HSA kernel has an associated kernel descriptor. It is the address of the
1550   kernel descriptor that is used in the AQL dispatch packet used to invoke the
1551   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1552   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1553
1554 Kernel entry point
1555   Every HSA kernel also has a symbol for its machine code entry point.
1556
1557 .. _amdgpu-relocation-records:
1558
1559 Relocation Records
1560 ------------------
1561
1562 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1563 relocatable fields are:
1564
1565 ``word32``
1566   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1567   alignment. These values use the same byte order as other word values in the
1568   AMDGPU architecture.
1569
1570 ``word64``
1571   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1572   alignment. These values use the same byte order as other word values in the
1573   AMDGPU architecture.
1574
1575 Following notations are used for specifying relocation calculations:
1576
1577 **A**
1578   Represents the addend used to compute the value of the relocatable field.
1579
1580 **G**
1581   Represents the offset into the global offset table at which the relocation
1582   entry's symbol will reside during execution.
1583
1584 **GOT**
1585   Represents the address of the global offset table.
1586
1587 **P**
1588   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1589   of the storage unit being relocated (computed using ``r_offset``).
1590
1591 **S**
1592   Represents the value of the symbol whose index resides in the relocation
1593   entry. Relocations not using this must specify a symbol index of
1594   ``STN_UNDEF``.
1595
1596 **B**
1597   Represents the base address of a loaded executable or shared object which is
1598   the difference between the ELF address and the actual load address.
1599   Relocations using this are only valid in executable or shared objects.
1600
1601 The following relocation types are supported:
1602
1603   .. table:: AMDGPU ELF Relocation Records
1604      :name: amdgpu-elf-relocation-records-table
1605
1606      ========================== ======= =====  ==========  ==============================
1607      Relocation Type            Kind    Value  Field       Calculation
1608      ========================== ======= =====  ==========  ==============================
1609      ``R_AMDGPU_NONE``                  0      *none*      *none*
1610      ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
1611                                 Dynamic
1612      ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
1613                                 Dynamic
1614      ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
1615                                 Dynamic
1616      ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
1617      ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
1618      ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
1619                                 Dynamic
1620      ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
1621      ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
1622      ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
1623      ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
1624      ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
1625      *reserved*                         12
1626      ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
1627      ``R_AMDGPU_REL16``         Static  14     ``word16``  ((S + A - P) - 4) / 4
1628      ========================== ======= =====  ==========  ==============================
1629
1630 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1631 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1632
1633 There is no current OS loader support for 32-bit programs and so
1634 ``R_AMDGPU_ABS32`` is not used.
1635
1636 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1637
1638 Loaded Code Object Path Uniform Resource Identifier (URI)
1639 ---------------------------------------------------------
1640
1641 The AMD GPU code object loader represents the path of the ELF shared object from
1642 which the code object was loaded as a textual Uniform Resource Identifier (URI).
1643 Note that the code object is the in memory loaded relocated form of the ELF
1644 shared object.  Multiple code objects may be loaded at different memory
1645 addresses in the same process from the same ELF shared object.
1646
1647 The loaded code object path URI syntax is defined by the following BNF syntax:
1648
1649 .. code::
1650
1651   code_object_uri ::== file_uri | memory_uri
1652   file_uri        ::== "file://" file_path [ range_specifier ]
1653   memory_uri      ::== "memory://" process_id range_specifier
1654   range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1655   file_path       ::== URI_ENCODED_OS_FILE_PATH
1656   process_id      ::== DECIMAL_NUMBER
1657   number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1658
1659 **number**
1660   Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1661   and octal values by "0".
1662
1663 **file_path**
1664   Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1665   every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1666   encoded as two uppercase hexadecimal digits proceeded by "%".  Directories in
1667   the path are separated by "/".
1668
1669 **offset**
1670   Is a 0-based byte offset to the start of the code object.  For a file URI, it
1671   is from the start of the file specified by the ``file_path``, and if omitted
1672   defaults to 0. For a memory URI, it is the memory address and is required.
1673
1674 **size**
1675   Is the number of bytes in the code object.  For a file URI, if omitted it
1676   defaults to the size of the file.  It is required for a memory URI.
1677
1678 **process_id**
1679   Is the identity of the process owning the memory.  For Linux it is the C
1680   unsigned integral decimal literal for the process ID (PID).
1681
1682 For example:
1683
1684 .. code::
1685
1686   file:///dir1/dir2/file1
1687   file:///dir3/dir4/file2#offset=0x2000&size=3000
1688   memory://1234#offset=0x20000&size=3000
1689
1690 .. _amdgpu-dwarf-debug-information:
1691
1692 DWARF Debug Information
1693 =======================
1694
1695 .. warning::
1696
1697    This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1698    is not currently fully implemented and is subject to change.
1699
1700 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1701 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
1702 object executable code and data to the source language constructs. It can be
1703 used by tools such as debuggers and profilers. It uses features defined in
1704 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1705 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1706
1707 This section defines the AMDGPU target architecture specific DWARF mappings.
1708
1709 .. _amdgpu-dwarf-register-identifier:
1710
1711 Register Identifier
1712 -------------------
1713
1714 This section defines the AMDGPU target architecture register numbers used in
1715 DWARF operation expressions (see DWARF Version 5 section 2.5 and
1716 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1717 instructions (see DWARF Version 5 section 6.4 and
1718 :ref:`amdgpu-dwarf-call-frame-information`).
1719
1720 A single code object can contain code for kernels that have different wavefront
1721 sizes. The vector registers and some scalar registers are based on the wavefront
1722 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1723 simplifies the consumer of the DWARF so that each register has a fixed size,
1724 rather than being dynamic according to the wavefront size mode. Similarly,
1725 distinct DWARF registers are defined for those registers that vary in size
1726 according to the process address size. This allows a consumer to treat a
1727 specific AMDGPU processor as a single architecture regardless of how it is
1728 configured at run time. The compiler explicitly specifies the DWARF registers
1729 that match the mode in which the code it is generating will be executed.
1730
1731 DWARF registers are encoded as numbers, which are mapped to architecture
1732 registers. The mapping for AMDGPU is defined in
1733 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1734 mapping.
1735
1736 .. table:: AMDGPU DWARF Register Mapping
1737    :name: amdgpu-dwarf-register-mapping-table
1738
1739    ============== ================= ======== ==================================
1740    DWARF Register AMDGPU Register   Bit Size Description
1741    ============== ================= ======== ==================================
1742    0              PC_32             32       Program Counter (PC) when
1743                                              executing in a 32-bit process
1744                                              address space. Used in the CFI to
1745                                              describe the PC of the calling
1746                                              frame.
1747    1              EXEC_MASK_32      32       Execution Mask Register when
1748                                              executing in wavefront 32 mode.
1749    2-15           *Reserved*                 *Reserved for highly accessed
1750                                              registers using DWARF shortcut.*
1751    16             PC_64             64       Program Counter (PC) when
1752                                              executing in a 64-bit process
1753                                              address space. Used in the CFI to
1754                                              describe the PC of the calling
1755                                              frame.
1756    17             EXEC_MASK_64      64       Execution Mask Register when
1757                                              executing in wavefront 64 mode.
1758    18-31          *Reserved*                 *Reserved for highly accessed
1759                                              registers using DWARF shortcut.*
1760    32-95          SGPR0-SGPR63      32       Scalar General Purpose
1761                                              Registers.
1762    96-127         *Reserved*                 *Reserved for frequently accessed
1763                                              registers using DWARF 1-byte ULEB.*
1764    128            STATUS            32       Status Register.
1765    129-511        *Reserved*                 *Reserved for future Scalar
1766                                              Architectural Registers.*
1767    512            VCC_32            32       Vector Condition Code Register
1768                                              when executing in wavefront 32
1769                                              mode.
1770    513-767        *Reserved*                 *Reserved for future Vector
1771                                              Architectural Registers when
1772                                              executing in wavefront 32 mode.*
1773    768            VCC_64            64       Vector Condition Code Register
1774                                              when executing in wavefront 64
1775                                              mode.
1776    769-1023       *Reserved*                 *Reserved for future Vector
1777                                              Architectural Registers when
1778                                              executing in wavefront 64 mode.*
1779    1024-1087      *Reserved*                 *Reserved for padding.*
1780    1088-1129      SGPR64-SGPR105    32       Scalar General Purpose Registers.
1781    1130-1535      *Reserved*                 *Reserved for future Scalar
1782                                              General Purpose Registers.*
1783    1536-1791      VGPR0-VGPR255     32*32    Vector General Purpose Registers
1784                                              when executing in wavefront 32
1785                                              mode.
1786    1792-2047      *Reserved*                 *Reserved for future Vector
1787                                              General Purpose Registers when
1788                                              executing in wavefront 32 mode.*
1789    2048-2303      AGPR0-AGPR255     32*32    Vector Accumulation Registers
1790                                              when executing in wavefront 32
1791                                              mode.
1792    2304-2559      *Reserved*                 *Reserved for future Vector
1793                                              Accumulation Registers when
1794                                              executing in wavefront 32 mode.*
1795    2560-2815      VGPR0-VGPR255     64*32    Vector General Purpose Registers
1796                                              when executing in wavefront 64
1797                                              mode.
1798    2816-3071      *Reserved*                 *Reserved for future Vector
1799                                              General Purpose Registers when
1800                                              executing in wavefront 64 mode.*
1801    3072-3327      AGPR0-AGPR255     64*32    Vector Accumulation Registers
1802                                              when executing in wavefront 64
1803                                              mode.
1804    3328-3583      *Reserved*                 *Reserved for future Vector
1805                                              Accumulation Registers when
1806                                              executing in wavefront 64 mode.*
1807    ============== ================= ======== ==================================
1808
1809 The vector registers are represented as the full size for the wavefront. They
1810 are organized as consecutive dwords (32-bits), one per lane, with the dword at
1811 the least significant bit position corresponding to lane 0 and so forth. DWARF
1812 location expressions involving the ``DW_OP_LLVM_offset`` and
1813 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1814 register corresponding to the lane that is executing the current thread of
1815 execution in languages that are implemented using a SIMD or SIMT execution
1816 model.
1817
1818 If the wavefront size is 32 lanes then the wavefront 32 mode register
1819 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1820 mode register definitions are used. Some AMDGPU targets support executing in
1821 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1822 to the wavefront mode of the generated code will be used.
1823
1824 If code is generated to execute in a 32-bit process address space, then the
1825 32-bit process address space register definitions are used. If code is generated
1826 to execute in a 64-bit process address space, then the 64-bit process address
1827 space register definitions are used. The ``amdgcn`` target only supports the
1828 64-bit process address space.
1829
1830 .. _amdgpu-dwarf-address-class-identifier:
1831
1832 Address Class Identifier
1833 ------------------------
1834
1835 The DWARF address class represents the source language memory space. See DWARF
1836 Version 5 section 2.12 which is updated by the *DWARF Extensions For
1837 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1838
1839 The DWARF address class mapping used for AMDGPU is defined in
1840 :ref:`amdgpu-dwarf-address-class-mapping-table`.
1841
1842 .. table:: AMDGPU DWARF Address Class Mapping
1843    :name: amdgpu-dwarf-address-class-mapping-table
1844
1845    ========================= ====== =================
1846    DWARF                            AMDGPU
1847    -------------------------------- -----------------
1848    Address Class Name        Value  Address Space
1849    ========================= ====== =================
1850    ``DW_ADDR_none``          0x0000 Generic (Flat)
1851    ``DW_ADDR_LLVM_global``   0x0001 Global
1852    ``DW_ADDR_LLVM_constant`` 0x0002 Global
1853    ``DW_ADDR_LLVM_group``    0x0003 Local (group/LDS)
1854    ``DW_ADDR_LLVM_private``  0x0004 Private (Scratch)
1855    ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS)
1856    ========================= ====== =================
1857
1858 The DWARF address class values defined in the *DWARF Extensions For
1859 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used.
1860
1861 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1862 available for use for the AMD extension for access to the hardware GDS memory
1863 which is scratchpad memory allocated per device.
1864
1865 For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default
1866 address class of ``DW_ADDR_none`` is used.
1867
1868 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1869 mapping of DWARF address classes to DWARF address spaces, including address size
1870 and NULL value.
1871
1872 .. _amdgpu-dwarf-address-space-identifier:
1873
1874 Address Space Identifier
1875 ------------------------
1876
1877 DWARF address spaces correspond to target architecture specific linear
1878 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
1879 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1880
1881 The DWARF address space mapping used for AMDGPU is defined in
1882 :ref:`amdgpu-dwarf-address-space-mapping-table`.
1883
1884 .. table:: AMDGPU DWARF Address Space Mapping
1885    :name: amdgpu-dwarf-address-space-mapping-table
1886
1887    ======================================= ===== ======= ======== ================= =======================
1888    DWARF                                                          AMDGPU            Notes
1889    --------------------------------------- ----- ---------------- ----------------- -----------------------
1890    Address Space Name                      Value Address Bit Size Address Space
1891    --------------------------------------- ----- ------- -------- ----------------- -----------------------
1892    ..                                            64-bit  32-bit
1893                                                  process process
1894                                                  address address
1895                                                  space   space
1896    ======================================= ===== ======= ======== ================= =======================
1897    ``DW_ASPACE_none``                      0x00  64      32       Global            *default address space*
1898    ``DW_ASPACE_AMDGPU_generic``            0x01  64      32       Generic (Flat)
1899    ``DW_ASPACE_AMDGPU_region``             0x02  32      32       Region (GDS)
1900    ``DW_ASPACE_AMDGPU_local``              0x03  32      32       Local (group/LDS)
1901    *Reserved*                              0x04
1902    ``DW_ASPACE_AMDGPU_private_lane``       0x05  32      32       Private (Scratch) *focused lane*
1903    ``DW_ASPACE_AMDGPU_private_wave``       0x06  32      32       Private (Scratch) *unswizzled wavefront*
1904    ======================================= ===== ======= ======== ================= =======================
1905
1906 See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
1907 including address size and NULL value.
1908
1909 The ``DW_ASPACE_none`` address space is the default target architecture address
1910 space used in DWARF operations that do not specify an address space. It
1911 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1912 related operations can refer to addresses in the program code.
1913
1914 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1915 specify the flat address space. If the address corresponds to an address in the
1916 local address space, then it corresponds to the wavefront that is executing the
1917 focused thread of execution. If the address corresponds to an address in the
1918 private address space, then it corresponds to the lane that is executing the
1919 focused thread of execution for languages that are implemented using a SIMD or
1920 SIMT execution model.
1921
1922 .. note::
1923
1924   CUDA-like languages such as HIP that do not have address spaces in the
1925   language type system, but do allow variables to be allocated in different
1926   address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
1927   address space in the DWARF expression operations as the default address space
1928   is the global address space.
1929
1930 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
1931 specify the local address space corresponding to the wavefront that is executing
1932 the focused thread of execution.
1933
1934 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
1935 to specify the private address space corresponding to the lane that is executing
1936 the focused thread of execution for languages that are implemented using a SIMD
1937 or SIMT execution model.
1938
1939 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
1940 to specify the unswizzled private address space corresponding to the wavefront
1941 that is executing the focused thread of execution. The wavefront view of private
1942 memory is the per wavefront unswizzled backing memory layout defined in
1943 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
1944 location for the backing memory of the wavefront (namely the address is not
1945 offset by ``wavefront-scratch-base``). The following formula can be used to
1946 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
1947 ``DW_ASPACE_AMDGPU_private_wave`` address:
1948
1949 ::
1950
1951   private-address-wavefront =
1952     ((private-address-lane / 4) * wavefront-size * 4) +
1953     (wavefront-lane-id * 4) + (private-address-lane % 4)
1954
1955 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
1956 of the dwords for each lane starting with lane 0 is required, then this
1957 simplifies to:
1958
1959 ::
1960
1961   private-address-wavefront =
1962     private-address-lane * wavefront-size
1963
1964 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
1965 complete spilled vector register back into a complete vector register in the
1966 CFI. The frame pointer can be a private lane address which is dword aligned,
1967 which can be shifted to multiply by the wavefront size, and then used to form a
1968 private wavefront address that gives a location for a contiguous set of dwords,
1969 one per lane, where the vector register dwords are spilled. The compiler knows
1970 the wavefront size since it generates the code. Note that the type of the
1971 address may have to be converted as the size of a
1972 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
1973 ``DW_ASPACE_AMDGPU_private_wave`` address.
1974
1975 .. _amdgpu-dwarf-lane-identifier:
1976
1977 Lane identifier
1978 ---------------
1979
1980 DWARF lane identifies specify a target architecture lane position for hardware
1981 that executes in a SIMD or SIMT manner, and on which a source language maps its
1982 threads of execution onto those lanes. The DWARF lane identifier is pushed by
1983 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
1984 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
1985 section :ref:`amdgpu-dwarf-operation-expressions`.
1986
1987 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
1988 wavefront. It is numbered from 0 to the wavefront size minus 1.
1989
1990 Operation Expressions
1991 ---------------------
1992
1993 DWARF expressions are used to compute program values and the locations of
1994 program objects. See DWARF Version 5 section 2.5 and
1995 :ref:`amdgpu-dwarf-operation-expressions`.
1996
1997 DWARF location descriptions describe how to access storage which includes memory
1998 and registers. When accessing storage on AMDGPU, bytes are ordered with least
1999 significant bytes first, and bits are ordered within bytes with least
2000 significant bits first.
2001
2002 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2003 unwinding vector registers that are spilled under the execution mask to memory:
2004 the zero-single location description is the vector register, and the one-single
2005 location description is the spilled memory location description. The
2006 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2007 memory location description.
2008
2009 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2010 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2011 controlled by the execution mask. An undefined location description together
2012 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2013 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2014
2015 Debugger Information Entry Attributes
2016 -------------------------------------
2017
2018 This section describes how certain debugger information entry attributes are
2019 used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2020 which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2021 :ref:`amdgpu-dwarf-low-level-information` and
2022 :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2023
2024 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2025
2026 ``DW_AT_LLVM_lane_pc``
2027 ~~~~~~~~~~~~~~~~~~~~~~
2028
2029 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2030 location of the separate lanes of a SIMT thread.
2031
2032 If the lane is an active lane then this will be the same as the current program
2033 location.
2034
2035 If the lane is inactive, but was active on entry to the subprogram, then this is
2036 the program location in the subprogram at which execution of the lane is
2037 conceptual positioned.
2038
2039 If the lane was not active on entry to the subprogram, then this will be the
2040 undefined location. A client debugger can check if the lane is part of a valid
2041 work-group by checking that the lane is in the range of the associated
2042 work-group within the grid, accounting for partial work-groups. If it is not,
2043 then the debugger can omit any information for the lane. Otherwise, the debugger
2044 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2045 calling subprogram until it finds a non-undefined location. Conceptually the
2046 lane only has the call frames that it has a non-undefined
2047 ``DW_AT_LLVM_lane_pc``.
2048
2049 The following example illustrates how the AMDGPU backend can generate a DWARF
2050 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2051 following subprogram pseudo code for a target with 64 lanes per wavefront.
2052
2053 .. code::
2054   :number-lines:
2055
2056   SUBPROGRAM X
2057   BEGIN
2058     a;
2059     IF (c1) THEN
2060       b;
2061       IF (c2) THEN
2062         c;
2063       ELSE
2064         d;
2065       ENDIF
2066       e;
2067     ELSE
2068       f;
2069     ENDIF
2070     g;
2071   END
2072
2073 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2074 execution mask (``EXEC``) to linearize the control flow. The condition is
2075 evaluated to make a mask of the lanes for which the condition evaluates to true.
2076 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2077 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2078 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2079 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2080 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2081 region. This is shown below. Other approaches are possible, but the basic
2082 concept is the same.
2083
2084 .. code::
2085   :number-lines:
2086
2087   $lex_start:
2088     a;
2089     %1 = EXEC
2090     %2 = c1
2091   $lex_1_start:
2092     EXEC = %1 & %2
2093   $if_1_then:
2094       b;
2095       %3 = EXEC
2096       %4 = c2
2097   $lex_1_1_start:
2098       EXEC = %3 & %4
2099   $lex_1_1_then:
2100         c;
2101       EXEC = ~EXEC & %3
2102   $lex_1_1_else:
2103         d;
2104       EXEC = %3
2105   $lex_1_1_end:
2106       e;
2107     EXEC = ~EXEC & %1
2108   $lex_1_else:
2109       f;
2110     EXEC = %1
2111   $lex_1_end:
2112     g;
2113   $lex_end:
2114
2115 To create the DWARF location list expression that defines the location
2116 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2117 pseudo instruction can be used to annotate the linearized control flow. This can
2118 be done by defining an artificial variable for the lane PC. The DWARF location
2119 list expression created for it is used as the value of the
2120 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2121
2122 A DWARF procedure is defined for each well nested structured control flow region
2123 which provides the conceptual lane program location for a lane if it is not
2124 active (namely it is divergent). The DWARF operation expression for each region
2125 conceptually inherits the value of the immediately enclosing region and modifies
2126 it according to the semantics of the region.
2127
2128 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2129 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2130 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2131 region since the ``THEN`` region has completed.
2132
2133 The lane PC artificial variable is assigned at each region transition. It uses
2134 the immediately enclosing region's DWARF procedure to compute the program
2135 location for each lane assuming they are divergent, and then modifies the result
2136 by inserting the current program location for each lane that the ``EXEC`` mask
2137 indicates is active.
2138
2139 By having separate DWARF procedures for each region, they can be reused to
2140 define the value for any nested region. This reduces the total size of the DWARF
2141 operation expressions.
2142
2143 The following provides an example using pseudo LLVM MIR.
2144
2145 .. code::
2146   :number-lines:
2147
2148   $lex_start:
2149     DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2150       DW_AT_name = "__uint64";
2151       DW_AT_byte_size = 8;
2152       DW_AT_encoding = DW_ATE_unsigned;
2153     ];
2154     DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2155       DW_AT_name = "__active_lane_pc";
2156       DW_AT_location = [
2157         DW_OP_regx PC;
2158         DW_OP_LLVM_extend 64, 64;
2159         DW_OP_regval_type EXEC, %uint_64;
2160         DW_OP_LLVM_select_bit_piece 64, 64;
2161       ];
2162     ];
2163     DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2164       DW_AT_name = "__divergent_lane_pc";
2165       DW_AT_location = [
2166         DW_OP_LLVM_undefined;
2167         DW_OP_LLVM_extend 64, 64;
2168       ];
2169     ];
2170     DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2171       DW_OP_call_ref %__divergent_lane_pc;
2172       DW_OP_call_ref %__active_lane_pc;
2173     ];
2174     a;
2175     %1 = EXEC;
2176     DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2177     %2 = c1;
2178   $lex_1_start:
2179     EXEC = %1 & %2;
2180   $lex_1_then:
2181       DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2182         DW_AT_name = "__divergent_lane_pc_1_then";
2183         DW_AT_location = DIExpression[
2184           DW_OP_call_ref %__divergent_lane_pc;
2185           DW_OP_addrx &lex_1_start;
2186           DW_OP_stack_value;
2187           DW_OP_LLVM_extend 64, 64;
2188           DW_OP_call_ref %__lex_1_save_exec;
2189           DW_OP_deref_type 64, %__uint_64;
2190           DW_OP_LLVM_select_bit_piece 64, 64;
2191         ];
2192       ];
2193       DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2194         DW_OP_call_ref %__divergent_lane_pc_1_then;
2195         DW_OP_call_ref %__active_lane_pc;
2196       ];
2197       b;
2198       %3 = EXEC;
2199       DBG_VALUE %3, %__lex_1_1_save_exec;
2200       %4 = c2;
2201   $lex_1_1_start:
2202       EXEC = %3 & %4;
2203   $lex_1_1_then:
2204         DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2205           DW_AT_name = "__divergent_lane_pc_1_1_then";
2206           DW_AT_location = DIExpression[
2207             DW_OP_call_ref %__divergent_lane_pc_1_then;
2208             DW_OP_addrx &lex_1_1_start;
2209             DW_OP_stack_value;
2210             DW_OP_LLVM_extend 64, 64;
2211             DW_OP_call_ref %__lex_1_1_save_exec;
2212             DW_OP_deref_type 64, %__uint_64;
2213             DW_OP_LLVM_select_bit_piece 64, 64;
2214           ];
2215         ];
2216         DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2217           DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2218           DW_OP_call_ref %__active_lane_pc;
2219         ];
2220         c;
2221       EXEC = ~EXEC & %3;
2222   $lex_1_1_else:
2223         DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2224           DW_AT_name = "__divergent_lane_pc_1_1_else";
2225           DW_AT_location = DIExpression[
2226             DW_OP_call_ref %__divergent_lane_pc_1_then;
2227             DW_OP_addrx &lex_1_1_end;
2228             DW_OP_stack_value;
2229             DW_OP_LLVM_extend 64, 64;
2230             DW_OP_call_ref %__lex_1_1_save_exec;
2231             DW_OP_deref_type 64, %__uint_64;
2232             DW_OP_LLVM_select_bit_piece 64, 64;
2233           ];
2234         ];
2235         DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2236           DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2237           DW_OP_call_ref %__active_lane_pc;
2238         ];
2239         d;
2240       EXEC = %3;
2241   $lex_1_1_end:
2242       DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2243         DW_OP_call_ref %__divergent_lane_pc;
2244         DW_OP_call_ref %__active_lane_pc;
2245       ];
2246       e;
2247     EXEC = ~EXEC & %1;
2248   $lex_1_else:
2249       DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2250         DW_AT_name = "__divergent_lane_pc_1_else";
2251         DW_AT_location = DIExpression[
2252           DW_OP_call_ref %__divergent_lane_pc;
2253           DW_OP_addrx &lex_1_end;
2254           DW_OP_stack_value;
2255           DW_OP_LLVM_extend 64, 64;
2256           DW_OP_call_ref %__lex_1_save_exec;
2257           DW_OP_deref_type 64, %__uint_64;
2258           DW_OP_LLVM_select_bit_piece 64, 64;
2259         ];
2260       ];
2261       DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2262         DW_OP_call_ref %__divergent_lane_pc_1_else;
2263         DW_OP_call_ref %__active_lane_pc;
2264       ];
2265       f;
2266     EXEC = %1;
2267   $lex_1_end:
2268     DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2269       DW_OP_call_ref %__divergent_lane_pc;
2270       DW_OP_call_ref %__active_lane_pc;
2271     ];
2272     g;
2273   $lex_end:
2274
2275 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2276 that are active, with the current program location.
2277
2278 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2279 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2280 instruction, location list entries will be created that describe where the
2281 artificial variables are allocated at any given program location. The compiler
2282 may allocate them to registers or spill them to memory.
2283
2284 The DWARF procedures for each region use the values of the saved execution mask
2285 artificial variables to only update the lanes that are active on entry to the
2286 region. All other lanes retain the value of the enclosing region where they were
2287 last active. If they were not active on entry to the subprogram, then will have
2288 the undefined location description.
2289
2290 Other structured control flow regions can be handled similarly. For example,
2291 loops would set the divergent program location for the region at the end of the
2292 loop. Any lanes active will be in the loop, and any lanes not active must have
2293 exited the loop.
2294
2295 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2296 ``IF/THEN/ELSE`` regions.
2297
2298 The DWARF procedures can use the active lane artificial variable described in
2299 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2300 ``EXEC`` mask in order to support whole or quad wavefront mode.
2301
2302 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2303
2304 ``DW_AT_LLVM_active_lane``
2305 ~~~~~~~~~~~~~~~~~~~~~~~~~~
2306
2307 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2308 entry is used to specify the lanes that are conceptually active for a SIMT
2309 thread.
2310
2311 The execution mask may be modified to implement whole or quad wavefront mode
2312 operations. For example, all lanes may need to temporarily be made active to
2313 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2314 update it to enable the necessary lanes, perform the operations, and then
2315 restore the ``EXEC`` mask from the saved value. While executing the whole
2316 wavefront region, the conceptual execution mask is the saved value, not the
2317 ``EXEC`` value.
2318
2319 This is handled by defining an artificial variable for the active lane mask. The
2320 active lane mask artificial variable would be the actual ``EXEC`` mask for
2321 normal regions, and the saved execution mask for regions where the mask is
2322 temporarily updated. The location list expression created for this artificial
2323 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2324 attribute.
2325
2326 ``DW_AT_LLVM_augmentation``
2327 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2328
2329 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2330 debugger information entry has the following value for the augmentation string:
2331
2332 ::
2333
2334   [amdgpu:v0.0]
2335
2336 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2337 extensions used in the DWARF of the compilation unit. The version number
2338 conforms to [SEMVER]_.
2339
2340 Call Frame Information
2341 ----------------------
2342
2343 DWARF Call Frame Information (CFI) describes how a consumer can virtually
2344 *unwind* call frames in a running process or core dump. See DWARF Version 5
2345 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2346
2347 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2348
2349 1.  ``augmentation`` string contains the following null-terminated UTF-8 string:
2350
2351     ::
2352
2353       [amd:v0.0]
2354
2355     The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2356     extensions used in this CIE or to the FDEs that use it. The version number
2357     conforms to [SEMVER]_.
2358
2359 2.  ``address_size`` for the ``Global`` address space is defined in
2360     :ref:`amdgpu-dwarf-address-space-identifier`.
2361
2362 3.  ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2363
2364 4.  ``code_alignment_factor`` is 4 bytes.
2365
2366     .. TODO::
2367
2368        Add to :ref:`amdgpu-processor-table` table.
2369
2370 5.  ``data_alignment_factor`` is 4 bytes.
2371
2372     .. TODO::
2373
2374        Add to :ref:`amdgpu-processor-table` table.
2375
2376 6.  ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2377     for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2378
2379 7.  ``initial_instructions`` Since a subprogram X with fewer registers can be
2380     called from subprogram Y that has more allocated, X will not change any of
2381     the extra registers as it cannot access them. Therefore, the default rule
2382     for all columns is ``same value``.
2383
2384 For AMDGPU the register number follows the numbering defined in
2385 :ref:`amdgpu-dwarf-register-identifier`.
2386
2387 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2388 the return address to get the address of a byte within the call site
2389 instructions. See DWARF Version 5 section 6.4.4.
2390
2391 Accelerated Access
2392 ------------------
2393
2394 See DWARF Version 5 section 6.1.
2395
2396 Lookup By Name Section Header
2397 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2398
2399 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2400
2401 For AMDGPU the lookup by name section header table:
2402
2403 ``augmentation_string_size`` (uword)
2404
2405   Set to the length of the ``augmentation_string`` value which is always a
2406   multiple of 4.
2407
2408 ``augmentation_string`` (sequence of UTF-8 characters)
2409
2410   Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2411
2412   ::
2413
2414     [amdgpu:v0.0]
2415
2416   The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2417   extensions used in the DWARF of this index. The version number conforms to
2418   [SEMVER]_.
2419
2420   .. note::
2421
2422     This is different to the DWARF Version 5 definition that requires the first
2423     4 characters to be the vendor ID. But this is consistent with the other
2424     augmentation strings and does allow multiple vendor contributions. However,
2425     backwards compatibility may be more desirable.
2426
2427 Lookup By Address Section Header
2428 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2429
2430 See DWARF Version 5 section 6.1.2.
2431
2432 For AMDGPU the lookup by address section header table:
2433
2434 ``address_size`` (ubyte)
2435
2436   Match the address size for the ``Global`` address space defined in
2437   :ref:`amdgpu-dwarf-address-space-identifier`.
2438
2439 ``segment_selector_size`` (ubyte)
2440
2441   AMDGPU does not use a segment selector so this is 0. The entries in the
2442   ``.debug_aranges`` do not have a segment selector.
2443
2444 Line Number Information
2445 -----------------------
2446
2447 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2448
2449 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2450 The instruction set must be obtained from the ELF file header ``e_flags`` field
2451 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2452 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2453
2454 .. TODO::
2455
2456   Should the ``isa`` state machine register be used to indicate if the code is
2457   in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2458
2459 For AMDGPU the line number program header fields have the following values (see
2460 DWARF Version 5 section 6.2.4):
2461
2462 ``address_size`` (ubyte)
2463   Matches the address size for the ``Global`` address space defined in
2464   :ref:`amdgpu-dwarf-address-space-identifier`.
2465
2466 ``segment_selector_size`` (ubyte)
2467   AMDGPU does not use a segment selector so this is 0.
2468
2469 ``minimum_instruction_length`` (ubyte)
2470   For GFX9-GFX10 this is 4.
2471
2472 ``maximum_operations_per_instruction`` (ubyte)
2473   For GFX9-GFX10 this is 1.
2474
2475 Source text for online-compiled programs (for example, those compiled by the
2476 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2477 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2478 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2479 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2480
2481 The Clang option used to control source embedding in AMDGPU is defined in
2482 :ref:`amdgpu-clang-debug-options-table`.
2483
2484   .. table:: AMDGPU Clang Debug Options
2485      :name: amdgpu-clang-debug-options-table
2486
2487      ==================== ==================================================
2488      Debug Flag           Description
2489      ==================== ==================================================
2490      -g[no-]embed-source  Enable/disable embedding source text in DWARF
2491                           debug sections. Useful for environments where
2492                           source cannot be written to disk, such as
2493                           when performing online compilation.
2494      ==================== ==================================================
2495
2496 For example:
2497
2498 ``-gembed-source``
2499   Enable the embedded source.
2500
2501 ``-gno-embed-source``
2502   Disable the embedded source.
2503
2504 32-Bit and 64-Bit DWARF Formats
2505 -------------------------------
2506
2507 See DWARF Version 5 section 7.4 and
2508 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2509
2510 For AMDGPU:
2511
2512 * For the ``amdgcn`` target architecture only the 64-bit process address space
2513   is supported.
2514
2515 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2516   the 32-bit DWARF format.
2517
2518 Unit Headers
2519 ------------
2520
2521 For AMDGPU the following values apply for each of the unit headers described in
2522 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2523
2524 ``address_size`` (ubyte)
2525   Matches the address size for the ``Global`` address space defined in
2526   :ref:`amdgpu-dwarf-address-space-identifier`.
2527
2528 .. _amdgpu-code-conventions:
2529
2530 Code Conventions
2531 ================
2532
2533 This section provides code conventions used for each supported target triple OS
2534 (see :ref:`amdgpu-target-triples`).
2535
2536 AMDHSA
2537 ------
2538
2539 This section provides code conventions used when the target triple OS is
2540 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
2541
2542 .. _amdgpu-amdhsa-code-object-metadata:
2543
2544 Code Object Metadata
2545 ~~~~~~~~~~~~~~~~~~~~
2546
2547 The code object metadata specifies extensible metadata associated with the code
2548 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2549 encoding and semantics of this metadata depends on the code object version; see
2550 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2551 :ref:`amdgpu-amdhsa-code-object-metadata-v3`, and
2552 :ref:`amdgpu-amdhsa-code-object-metadata-v4`.
2553
2554 Code object metadata is specified in a note record (see
2555 :ref:`amdgpu-note-records`) and is required when the target triple OS is
2556 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2557 information necessary to support the HSA compatible runtime kernel queries. For
2558 example, the segment sizes needed in a dispatch packet. In addition, a
2559 high-level language runtime may require other information to be included. For
2560 example, the AMD OpenCL runtime records kernel argument information.
2561
2562 .. _amdgpu-amdhsa-code-object-metadata-v2:
2563
2564 Code Object V2 Metadata
2565 +++++++++++++++++++++++
2566
2567 .. warning::
2568   Code object V2 is not the default code object version emitted by this version
2569   of LLVM.
2570
2571 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2572 (see :ref:`amdgpu-note-records-v2`).
2573
2574 The metadata is specified as a YAML formatted string (see [YAML]_ and
2575 :doc:`YamlIO`).
2576
2577 .. TODO::
2578
2579   Is the string null terminated? It probably should not if YAML allows it to
2580   contain null characters, otherwise it should be.
2581
2582 The metadata is represented as a single YAML document comprised of the mapping
2583 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2584 referenced tables.
2585
2586 For boolean values, the string values of ``false`` and ``true`` are used for
2587 false and true respectively.
2588
2589 Additional information can be added to the mappings. To avoid conflicts, any
2590 non-AMD key names should be prefixed by "*vendor-name*.".
2591
2592   .. table:: AMDHSA Code Object V2 Metadata Map
2593      :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2594
2595      ========== ============== ========= =======================================
2596      String Key Value Type     Required? Description
2597      ========== ============== ========= =======================================
2598      "Version"  sequence of    Required  - The first integer is the major
2599                 2 integers                 version. Currently 1.
2600                                          - The second integer is the minor
2601                                            version. Currently 0.
2602      "Printf"   sequence of              Each string is encoded information
2603                 strings                  about a printf function call. The
2604                                          encoded information is organized as
2605                                          fields separated by colon (':'):
2606
2607                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2608
2609                                          where:
2610
2611                                          ``ID``
2612                                            A 32-bit integer as a unique id for
2613                                            each printf function call
2614
2615                                          ``N``
2616                                            A 32-bit integer equal to the number
2617                                            of arguments of printf function call
2618                                            minus 1
2619
2620                                          ``S[i]`` (where i = 0, 1, ... , N-1)
2621                                            32-bit integers for the size in bytes
2622                                            of the i-th FormatString argument of
2623                                            the printf function call
2624
2625                                          FormatString
2626                                            The format string passed to the
2627                                            printf function call.
2628      "Kernels"  sequence of    Required  Sequence of the mappings for each
2629                 mapping                  kernel in the code object. See
2630                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2631                                          for the definition of the mapping.
2632      ========== ============== ========= =======================================
2633
2634 ..
2635
2636   .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2637      :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2638
2639      ================= ============== ========= ================================
2640      String Key        Value Type     Required? Description
2641      ================= ============== ========= ================================
2642      "Name"            string         Required  Source name of the kernel.
2643      "SymbolName"      string         Required  Name of the kernel
2644                                                 descriptor ELF symbol.
2645      "Language"        string                   Source language of the kernel.
2646                                                 Values include:
2647
2648                                                 - "OpenCL C"
2649                                                 - "OpenCL C++"
2650                                                 - "HCC"
2651                                                 - "OpenMP"
2652
2653      "LanguageVersion" sequence of              - The first integer is the major
2654                        2 integers                 version.
2655                                                 - The second integer is the
2656                                                   minor version.
2657      "Attrs"           mapping                  Mapping of kernel attributes.
2658                                                 See
2659                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2660                                                 for the mapping definition.
2661      "Args"            sequence of              Sequence of mappings of the
2662                        mapping                  kernel arguments. See
2663                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2664                                                 for the definition of the mapping.
2665      "CodeProps"       mapping                  Mapping of properties related to
2666                                                 the kernel code. See
2667                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2668                                                 for the mapping definition.
2669      ================= ============== ========= ================================
2670
2671 ..
2672
2673   .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2674      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2675
2676      =================== ============== ========= ==============================
2677      String Key          Value Type     Required? Description
2678      =================== ============== ========= ==============================
2679      "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
2680                          3 integers               must be >=1 and the dispatch
2681                                                   work-group size X, Y, Z must
2682                                                   correspond to the specified
2683                                                   values. Defaults to 0, 0, 0.
2684
2685                                                   Corresponds to the OpenCL
2686                                                   ``reqd_work_group_size``
2687                                                   attribute.
2688      "WorkGroupSizeHint" sequence of              The dispatch work-group size
2689                          3 integers               X, Y, Z is likely to be the
2690                                                   specified values.
2691
2692                                                   Corresponds to the OpenCL
2693                                                   ``work_group_size_hint``
2694                                                   attribute.
2695      "VecTypeHint"       string                   The name of a scalar or vector
2696                                                   type.
2697
2698                                                   Corresponds to the OpenCL
2699                                                   ``vec_type_hint`` attribute.
2700
2701      "RuntimeHandle"     string                   The external symbol name
2702                                                   associated with a kernel.
2703                                                   OpenCL runtime allocates a
2704                                                   global buffer for the symbol
2705                                                   and saves the kernel's address
2706                                                   to it, which is used for
2707                                                   device side enqueueing. Only
2708                                                   available for device side
2709                                                   enqueued kernels.
2710      =================== ============== ========= ==============================
2711
2712 ..
2713
2714   .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2715      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2716
2717      ================= ============== ========= ================================
2718      String Key        Value Type     Required? Description
2719      ================= ============== ========= ================================
2720      "Name"            string                   Kernel argument name.
2721      "TypeName"        string                   Kernel argument type name.
2722      "Size"            integer        Required  Kernel argument size in bytes.
2723      "Align"           integer        Required  Kernel argument alignment in
2724                                                 bytes. Must be a power of two.
2725      "ValueKind"       string         Required  Kernel argument kind that
2726                                                 specifies how to set up the
2727                                                 corresponding argument.
2728                                                 Values include:
2729
2730                                                 "ByValue"
2731                                                   The argument is copied
2732                                                   directly into the kernarg.
2733
2734                                                 "GlobalBuffer"
2735                                                   A global address space pointer
2736                                                   to the buffer data is passed
2737                                                   in the kernarg.
2738
2739                                                 "DynamicSharedPointer"
2740                                                   A group address space pointer
2741                                                   to dynamically allocated LDS
2742                                                   is passed in the kernarg.
2743
2744                                                 "Sampler"
2745                                                   A global address space
2746                                                   pointer to a S# is passed in
2747                                                   the kernarg.
2748
2749                                                 "Image"
2750                                                   A global address space
2751                                                   pointer to a T# is passed in
2752                                                   the kernarg.
2753
2754                                                 "Pipe"
2755                                                   A global address space pointer
2756                                                   to an OpenCL pipe is passed in
2757                                                   the kernarg.
2758
2759                                                 "Queue"
2760                                                   A global address space pointer
2761                                                   to an OpenCL device enqueue
2762                                                   queue is passed in the
2763                                                   kernarg.
2764
2765                                                 "HiddenGlobalOffsetX"
2766                                                   The OpenCL grid dispatch
2767                                                   global offset for the X
2768                                                   dimension is passed in the
2769                                                   kernarg.
2770
2771                                                 "HiddenGlobalOffsetY"
2772                                                   The OpenCL grid dispatch
2773                                                   global offset for the Y
2774                                                   dimension is passed in the
2775                                                   kernarg.
2776
2777                                                 "HiddenGlobalOffsetZ"
2778                                                   The OpenCL grid dispatch
2779                                                   global offset for the Z
2780                                                   dimension is passed in the
2781                                                   kernarg.
2782
2783                                                 "HiddenNone"
2784                                                   An argument that is not used
2785                                                   by the kernel. Space needs to
2786                                                   be left for it, but it does
2787                                                   not need to be set up.
2788
2789                                                 "HiddenPrintfBuffer"
2790                                                   A global address space pointer
2791                                                   to the runtime printf buffer
2792                                                   is passed in kernarg.
2793
2794                                                 "HiddenHostcallBuffer"
2795                                                   A global address space pointer
2796                                                   to the runtime hostcall buffer
2797                                                   is passed in kernarg.
2798
2799                                                 "HiddenDefaultQueue"
2800                                                   A global address space pointer
2801                                                   to the OpenCL device enqueue
2802                                                   queue that should be used by
2803                                                   the kernel by default is
2804                                                   passed in the kernarg.
2805
2806                                                 "HiddenCompletionAction"
2807                                                   A global address space pointer
2808                                                   to help link enqueued kernels into
2809                                                   the ancestor tree for determining
2810                                                   when the parent kernel has finished.
2811
2812                                                 "HiddenMultiGridSyncArg"
2813                                                   A global address space pointer for
2814                                                   multi-grid synchronization is
2815                                                   passed in the kernarg.
2816
2817      "ValueType"       string                   Unused and deprecated. This should no longer
2818                                                 be emitted, but is accepted for compatibility.
2819
2820
2821      "PointeeAlign"    integer                  Alignment in bytes of pointee
2822                                                 type for pointer type kernel
2823                                                 argument. Must be a power
2824                                                 of 2. Only present if
2825                                                 "ValueKind" is
2826                                                 "DynamicSharedPointer".
2827      "AddrSpaceQual"   string                   Kernel argument address space
2828                                                 qualifier. Only present if
2829                                                 "ValueKind" is "GlobalBuffer" or
2830                                                 "DynamicSharedPointer". Values
2831                                                 are:
2832
2833                                                 - "Private"
2834                                                 - "Global"
2835                                                 - "Constant"
2836                                                 - "Local"
2837                                                 - "Generic"
2838                                                 - "Region"
2839
2840                                                 .. TODO::
2841
2842                                                    Is GlobalBuffer only Global
2843                                                    or Constant? Is
2844                                                    DynamicSharedPointer always
2845                                                    Local? Can HCC allow Generic?
2846                                                    How can Private or Region
2847                                                    ever happen?
2848
2849      "AccQual"         string                   Kernel argument access
2850                                                 qualifier. Only present if
2851                                                 "ValueKind" is "Image" or
2852                                                 "Pipe". Values
2853                                                 are:
2854
2855                                                 - "ReadOnly"
2856                                                 - "WriteOnly"
2857                                                 - "ReadWrite"
2858
2859                                                 .. TODO::
2860
2861                                                    Does this apply to
2862                                                    GlobalBuffer?
2863
2864      "ActualAccQual"   string                   The actual memory accesses
2865                                                 performed by the kernel on the
2866                                                 kernel argument. Only present if
2867                                                 "ValueKind" is "GlobalBuffer",
2868                                                 "Image", or "Pipe". This may be
2869                                                 more restrictive than indicated
2870                                                 by "AccQual" to reflect what the
2871                                                 kernel actual does. If not
2872                                                 present then the runtime must
2873                                                 assume what is implied by
2874                                                 "AccQual" and "IsConst". Values
2875                                                 are:
2876
2877                                                 - "ReadOnly"
2878                                                 - "WriteOnly"
2879                                                 - "ReadWrite"
2880
2881      "IsConst"         boolean                  Indicates if the kernel argument
2882                                                 is const qualified. Only present
2883                                                 if "ValueKind" is
2884                                                 "GlobalBuffer".
2885
2886      "IsRestrict"      boolean                  Indicates if the kernel argument
2887                                                 is restrict qualified. Only
2888                                                 present if "ValueKind" is
2889                                                 "GlobalBuffer".
2890
2891      "IsVolatile"      boolean                  Indicates if the kernel argument
2892                                                 is volatile qualified. Only
2893                                                 present if "ValueKind" is
2894                                                 "GlobalBuffer".
2895
2896      "IsPipe"          boolean                  Indicates if the kernel argument
2897                                                 is pipe qualified. Only present
2898                                                 if "ValueKind" is "Pipe".
2899
2900                                                 .. TODO::
2901
2902                                                    Can GlobalBuffer be pipe
2903                                                    qualified?
2904
2905      ================= ============== ========= ================================
2906
2907 ..
2908
2909   .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2910      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
2911
2912      ============================ ============== ========= =====================
2913      String Key                   Value Type     Required? Description
2914      ============================ ============== ========= =====================
2915      "KernargSegmentSize"         integer        Required  The size in bytes of
2916                                                            the kernarg segment
2917                                                            that holds the values
2918                                                            of the arguments to
2919                                                            the kernel.
2920      "GroupSegmentFixedSize"      integer        Required  The amount of group
2921                                                            segment memory
2922                                                            required by a
2923                                                            work-group in
2924                                                            bytes. This does not
2925                                                            include any
2926                                                            dynamically allocated
2927                                                            group segment memory
2928                                                            that may be added
2929                                                            when the kernel is
2930                                                            dispatched.
2931      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
2932                                                            private address space
2933                                                            memory required for a
2934                                                            work-item in
2935                                                            bytes. If the kernel
2936                                                            uses a dynamic call
2937                                                            stack then additional
2938                                                            space must be added
2939                                                            to this value for the
2940                                                            call stack.
2941      "KernargSegmentAlign"        integer        Required  The maximum byte
2942                                                            alignment of
2943                                                            arguments in the
2944                                                            kernarg segment. Must
2945                                                            be a power of 2.
2946      "WavefrontSize"              integer        Required  Wavefront size. Must
2947                                                            be a power of 2.
2948      "NumSGPRs"                   integer        Required  Number of scalar
2949                                                            registers used by a
2950                                                            wavefront for
2951                                                            GFX6-GFX10. This
2952                                                            includes the special
2953                                                            SGPRs for VCC, Flat
2954                                                            Scratch (GFX7-GFX10)
2955                                                            and XNACK (for
2956                                                            GFX8-GFX10). It does
2957                                                            not include the 16
2958                                                            SGPR added if a trap
2959                                                            handler is
2960                                                            enabled. It is not
2961                                                            rounded up to the
2962                                                            allocation
2963                                                            granularity.
2964      "NumVGPRs"                   integer        Required  Number of vector
2965                                                            registers used by
2966                                                            each work-item for
2967                                                            GFX6-GFX10
2968      "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
2969                                                            work-group size
2970                                                            supported by the
2971                                                            kernel in work-items.
2972                                                            Must be >=1 and
2973                                                            consistent with
2974                                                            ReqdWorkGroupSize if
2975                                                            not 0, 0, 0.
2976      "NumSpilledSGPRs"            integer                  Number of stores from
2977                                                            a scalar register to
2978                                                            a register allocator
2979                                                            created spill
2980                                                            location.
2981      "NumSpilledVGPRs"            integer                  Number of stores from
2982                                                            a vector register to
2983                                                            a register allocator
2984                                                            created spill
2985                                                            location.
2986      ============================ ============== ========= =====================
2987
2988 .. _amdgpu-amdhsa-code-object-metadata-v3:
2989
2990 Code Object V3 Metadata
2991 +++++++++++++++++++++++
2992
2993 Code object V3 to V4 metadata is specified by the ``NT_AMDGPU_METADATA`` note
2994 record (see :ref:`amdgpu-note-records-v3-v4`).
2995
2996 The metadata is represented as Message Pack formatted binary data (see
2997 [MsgPack]_). The top level is a Message Pack map that includes the
2998 keys defined in table
2999 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3000 tables.
3001
3002 Additional information can be added to the maps. To avoid conflicts,
3003 any key names should be prefixed by "*vendor-name*." where
3004 ``vendor-name`` can be the name of the vendor and specific vendor
3005 tool that generates the information. The prefix is abbreviated to
3006 simply "." when it appears within a map that has been added by the
3007 same *vendor-name*.
3008
3009   .. table:: AMDHSA Code Object V3 Metadata Map
3010      :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3011
3012      ================= ============== ========= =======================================
3013      String Key        Value Type     Required? Description
3014      ================= ============== ========= =======================================
3015      "amdhsa.version"  sequence of    Required  - The first integer is the major
3016                        2 integers                 version. Currently 1.
3017                                                 - The second integer is the minor
3018                                                   version. Currently 0.
3019      "amdhsa.printf"   sequence of              Each string is encoded information
3020                        strings                  about a printf function call. The
3021                                                 encoded information is organized as
3022                                                 fields separated by colon (':'):
3023
3024                                                 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3025
3026                                                 where:
3027
3028                                                 ``ID``
3029                                                   A 32-bit integer as a unique id for
3030                                                   each printf function call
3031
3032                                                 ``N``
3033                                                   A 32-bit integer equal to the number
3034                                                   of arguments of printf function call
3035                                                   minus 1
3036
3037                                                 ``S[i]`` (where i = 0, 1, ... , N-1)
3038                                                   32-bit integers for the size in bytes
3039                                                   of the i-th FormatString argument of
3040                                                   the printf function call
3041
3042                                                 FormatString
3043                                                   The format string passed to the
3044                                                   printf function call.
3045      "amdhsa.kernels"  sequence of    Required  Sequence of the maps for each
3046                        map                      kernel in the code object. See
3047                                                 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3048                                                 for the definition of the keys included
3049                                                 in that map.
3050      ================= ============== ========= =======================================
3051
3052 ..
3053
3054   .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3055      :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3056
3057      =================================== ============== ========= ================================
3058      String Key                          Value Type     Required? Description
3059      =================================== ============== ========= ================================
3060      ".name"                             string         Required  Source name of the kernel.
3061      ".symbol"                           string         Required  Name of the kernel
3062                                                                   descriptor ELF symbol.
3063      ".language"                         string                   Source language of the kernel.
3064                                                                   Values include:
3065
3066                                                                   - "OpenCL C"
3067                                                                   - "OpenCL C++"
3068                                                                   - "HCC"
3069                                                                   - "HIP"
3070                                                                   - "OpenMP"
3071                                                                   - "Assembler"
3072
3073      ".language_version"                 sequence of              - The first integer is the major
3074                                          2 integers                 version.
3075                                                                   - The second integer is the
3076                                                                     minor version.
3077      ".args"                             sequence of              Sequence of maps of the
3078                                          map                      kernel arguments. See
3079                                                                   :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3080                                                                   for the definition of the keys
3081                                                                   included in that map.
3082      ".reqd_workgroup_size"              sequence of              If not 0, 0, 0 then all values
3083                                          3 integers               must be >=1 and the dispatch
3084                                                                   work-group size X, Y, Z must
3085                                                                   correspond to the specified
3086                                                                   values. Defaults to 0, 0, 0.
3087
3088                                                                   Corresponds to the OpenCL
3089                                                                   ``reqd_work_group_size``
3090                                                                   attribute.
3091      ".workgroup_size_hint"              sequence of              The dispatch work-group size
3092                                          3 integers               X, Y, Z is likely to be the
3093                                                                   specified values.
3094
3095                                                                   Corresponds to the OpenCL
3096                                                                   ``work_group_size_hint``
3097                                                                   attribute.
3098      ".vec_type_hint"                    string                   The name of a scalar or vector
3099                                                                   type.
3100
3101                                                                   Corresponds to the OpenCL
3102                                                                   ``vec_type_hint`` attribute.
3103
3104      ".device_enqueue_symbol"            string                   The external symbol name
3105                                                                   associated with a kernel.
3106                                                                   OpenCL runtime allocates a
3107                                                                   global buffer for the symbol
3108                                                                   and saves the kernel's address
3109                                                                   to it, which is used for
3110                                                                   device side enqueueing. Only
3111                                                                   available for device side
3112                                                                   enqueued kernels.
3113      ".kernarg_segment_size"             integer        Required  The size in bytes of
3114                                                                   the kernarg segment
3115                                                                   that holds the values
3116                                                                   of the arguments to
3117                                                                   the kernel.
3118      ".group_segment_fixed_size"         integer        Required  The amount of group
3119                                                                   segment memory
3120                                                                   required by a
3121                                                                   work-group in
3122                                                                   bytes. This does not
3123                                                                   include any
3124                                                                   dynamically allocated
3125                                                                   group segment memory
3126                                                                   that may be added
3127                                                                   when the kernel is
3128                                                                   dispatched.
3129      ".private_segment_fixed_size"       integer        Required  The amount of fixed
3130                                                                   private address space
3131                                                                   memory required for a
3132                                                                   work-item in
3133                                                                   bytes. If the kernel
3134                                                                   uses a dynamic call
3135                                                                   stack then additional
3136                                                                   space must be added
3137                                                                   to this value for the
3138                                                                   call stack.
3139      ".kernarg_segment_align"            integer        Required  The maximum byte
3140                                                                   alignment of
3141                                                                   arguments in the
3142                                                                   kernarg segment. Must
3143                                                                   be a power of 2.
3144      ".wavefront_size"                   integer        Required  Wavefront size. Must
3145                                                                   be a power of 2.
3146      ".sgpr_count"                       integer        Required  Number of scalar
3147                                                                   registers required by a
3148                                                                   wavefront for
3149                                                                   GFX6-GFX9. A register
3150                                                                   is required if it is
3151                                                                   used explicitly, or
3152                                                                   if a higher numbered
3153                                                                   register is used
3154                                                                   explicitly. This
3155                                                                   includes the special
3156                                                                   SGPRs for VCC, Flat
3157                                                                   Scratch (GFX7-GFX9)
3158                                                                   and XNACK (for
3159                                                                   GFX8-GFX9). It does
3160                                                                   not include the 16
3161                                                                   SGPR added if a trap
3162                                                                   handler is
3163                                                                   enabled. It is not
3164                                                                   rounded up to the
3165                                                                   allocation
3166                                                                   granularity.
3167      ".vgpr_count"                       integer        Required  Number of vector
3168                                                                   registers required by
3169                                                                   each work-item for
3170                                                                   GFX6-GFX9. A register
3171                                                                   is required if it is
3172                                                                   used explicitly, or
3173                                                                   if a higher numbered
3174                                                                   register is used
3175                                                                   explicitly.
3176      ".max_flat_workgroup_size"          integer        Required  Maximum flat
3177                                                                   work-group size
3178                                                                   supported by the
3179                                                                   kernel in work-items.
3180                                                                   Must be >=1 and
3181                                                                   consistent with
3182                                                                   ReqdWorkGroupSize if
3183                                                                   not 0, 0, 0.
3184      ".sgpr_spill_count"                 integer                  Number of stores from
3185                                                                   a scalar register to
3186                                                                   a register allocator
3187                                                                   created spill
3188                                                                   location.
3189      ".vgpr_spill_count"                 integer                  Number of stores from
3190                                                                   a vector register to
3191                                                                   a register allocator
3192                                                                   created spill
3193                                                                   location.
3194      ".kind"                             string                   The kind of the kernel
3195                                                                   with the following
3196                                                                   values:
3197
3198                                                                   "normal"
3199                                                                     Regular kernels.
3200
3201                                                                   "init"
3202                                                                     These kernels must be
3203                                                                     invoked after loading
3204                                                                     the containing code
3205                                                                     object and must
3206                                                                     complete before any
3207                                                                     normal and fini
3208                                                                     kernels in the same
3209                                                                     code object are
3210                                                                     invoked.
3211
3212                                                                   "fini"
3213                                                                     These kernels must be
3214                                                                     invoked before
3215                                                                     unloading the
3216                                                                     containing code object
3217                                                                     and after all init and
3218                                                                     normal kernels in the
3219                                                                     same code object have
3220                                                                     been invoked and
3221                                                                     completed.
3222
3223                                                                   If omitted, "normal" is
3224                                                                   assumed.
3225      =================================== ============== ========= ================================
3226
3227 ..
3228
3229   .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3230      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3231
3232      ====================== ============== ========= ================================
3233      String Key             Value Type     Required? Description
3234      ====================== ============== ========= ================================
3235      ".name"                string                   Kernel argument name.
3236      ".type_name"           string                   Kernel argument type name.
3237      ".size"                integer        Required  Kernel argument size in bytes.
3238      ".offset"              integer        Required  Kernel argument offset in
3239                                                      bytes. The offset must be a
3240                                                      multiple of the alignment
3241                                                      required by the argument.
3242      ".value_kind"          string         Required  Kernel argument kind that
3243                                                      specifies how to set up the
3244                                                      corresponding argument.
3245                                                      Values include:
3246
3247                                                      "by_value"
3248                                                        The argument is copied
3249                                                        directly into the kernarg.
3250
3251                                                      "global_buffer"
3252                                                        A global address space pointer
3253                                                        to the buffer data is passed
3254                                                        in the kernarg.
3255
3256                                                      "dynamic_shared_pointer"
3257                                                        A group address space pointer
3258                                                        to dynamically allocated LDS
3259                                                        is passed in the kernarg.
3260
3261                                                      "sampler"
3262                                                        A global address space
3263                                                        pointer to a S# is passed in
3264                                                        the kernarg.
3265
3266                                                      "image"
3267                                                        A global address space
3268                                                        pointer to a T# is passed in
3269                                                        the kernarg.
3270
3271                                                      "pipe"
3272                                                        A global address space pointer
3273                                                        to an OpenCL pipe is passed in
3274                                                        the kernarg.
3275
3276                                                      "queue"
3277                                                        A global address space pointer
3278                                                        to an OpenCL device enqueue
3279                                                        queue is passed in the
3280                                                        kernarg.
3281
3282                                                      "hidden_global_offset_x"
3283                                                        The OpenCL grid dispatch
3284                                                        global offset for the X
3285                                                        dimension is passed in the
3286                                                        kernarg.
3287
3288                                                      "hidden_global_offset_y"
3289                                                        The OpenCL grid dispatch
3290                                                        global offset for the Y
3291                                                        dimension is passed in the
3292                                                        kernarg.
3293
3294                                                      "hidden_global_offset_z"
3295                                                        The OpenCL grid dispatch
3296                                                        global offset for the Z
3297                                                        dimension is passed in the
3298                                                        kernarg.
3299
3300                                                      "hidden_none"
3301                                                        An argument that is not used
3302                                                        by the kernel. Space needs to
3303                                                        be left for it, but it does
3304                                                        not need to be set up.
3305
3306                                                      "hidden_printf_buffer"
3307                                                        A global address space pointer
3308                                                        to the runtime printf buffer
3309                                                        is passed in kernarg.
3310
3311                                                      "hidden_hostcall_buffer"
3312                                                        A global address space pointer
3313                                                        to the runtime hostcall buffer
3314                                                        is passed in kernarg.
3315
3316                                                      "hidden_default_queue"
3317                                                        A global address space pointer
3318                                                        to the OpenCL device enqueue
3319                                                        queue that should be used by
3320                                                        the kernel by default is
3321                                                        passed in the kernarg.
3322
3323                                                      "hidden_completion_action"
3324                                                        A global address space pointer
3325                                                        to help link enqueued kernels into
3326                                                        the ancestor tree for determining
3327                                                        when the parent kernel has finished.
3328
3329                                                      "hidden_multigrid_sync_arg"
3330                                                        A global address space pointer for
3331                                                        multi-grid synchronization is
3332                                                        passed in the kernarg.
3333
3334      ".value_type"          string                    Unused and deprecated. This should no longer
3335                                                       be emitted, but is accepted for compatibility.
3336
3337      ".pointee_align"       integer                  Alignment in bytes of pointee
3338                                                      type for pointer type kernel
3339                                                      argument. Must be a power
3340                                                      of 2. Only present if
3341                                                      ".value_kind" is
3342                                                      "dynamic_shared_pointer".
3343      ".address_space"       string                   Kernel argument address space
3344                                                      qualifier. Only present if
3345                                                      ".value_kind" is "global_buffer" or
3346                                                      "dynamic_shared_pointer". Values
3347                                                      are:
3348
3349                                                      - "private"
3350                                                      - "global"
3351                                                      - "constant"
3352                                                      - "local"
3353                                                      - "generic"
3354                                                      - "region"
3355
3356                                                      .. TODO::
3357
3358                                                         Is "global_buffer" only "global"
3359                                                         or "constant"? Is
3360                                                         "dynamic_shared_pointer" always
3361                                                         "local"? Can HCC allow "generic"?
3362                                                         How can "private" or "region"
3363                                                         ever happen?
3364
3365      ".access"              string                   Kernel argument access
3366                                                      qualifier. Only present if
3367                                                      ".value_kind" is "image" or
3368                                                      "pipe". Values
3369                                                      are:
3370
3371                                                      - "read_only"
3372                                                      - "write_only"
3373                                                      - "read_write"
3374
3375                                                      .. TODO::
3376
3377                                                         Does this apply to
3378                                                         "global_buffer"?
3379
3380      ".actual_access"       string                   The actual memory accesses
3381                                                      performed by the kernel on the
3382                                                      kernel argument. Only present if
3383                                                      ".value_kind" is "global_buffer",
3384                                                      "image", or "pipe". This may be
3385                                                      more restrictive than indicated
3386                                                      by ".access" to reflect what the
3387                                                      kernel actual does. If not
3388                                                      present then the runtime must
3389                                                      assume what is implied by
3390                                                      ".access" and ".is_const"      . Values
3391                                                      are:
3392
3393                                                      - "read_only"
3394                                                      - "write_only"
3395                                                      - "read_write"
3396
3397      ".is_const"            boolean                  Indicates if the kernel argument
3398                                                      is const qualified. Only present
3399                                                      if ".value_kind" is
3400                                                      "global_buffer".
3401
3402      ".is_restrict"         boolean                  Indicates if the kernel argument
3403                                                      is restrict qualified. Only
3404                                                      present if ".value_kind" is
3405                                                      "global_buffer".
3406
3407      ".is_volatile"         boolean                  Indicates if the kernel argument
3408                                                      is volatile qualified. Only
3409                                                      present if ".value_kind" is
3410                                                      "global_buffer".
3411
3412      ".is_pipe"             boolean                  Indicates if the kernel argument
3413                                                      is pipe qualified. Only present
3414                                                      if ".value_kind" is "pipe".
3415
3416                                                      .. TODO::
3417
3418                                                         Can "global_buffer" be pipe
3419                                                         qualified?
3420
3421      ====================== ============== ========= ================================
3422
3423 .. _amdgpu-amdhsa-code-object-metadata-v4:
3424
3425 Code Object V4 Metadata
3426 +++++++++++++++++++++++
3427
3428 .. warning::
3429   Code object V4 is not the default code object version emitted by this version
3430   of LLVM.
3431
3432 Code object V4 metadata is the same as
3433 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3434 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3`.
3435
3436   .. table:: AMDHSA Code Object V4 Metadata Map Changes from :ref:`amdgpu-amdhsa-code-object-metadata-v3`
3437      :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3438
3439      ================= ============== ========= =======================================
3440      String Key        Value Type     Required? Description
3441      ================= ============== ========= =======================================
3442      "amdhsa.version"  sequence of    Required  - The first integer is the major
3443                        2 integers                 version. Currently 1.
3444                                                 - The second integer is the minor
3445                                                   version. Currently 1.
3446      "amdhsa.target"   string         Required  The target name of the code using the syntax:
3447
3448                                                 .. code::
3449
3450                                                   <target-triple> [ "-" <target-id> ]
3451
3452                                                 A canonical target ID must be
3453                                                 used. See :ref:`amdgpu-target-triples`
3454                                                 and :ref:`amdgpu-target-id`.
3455      ================= ============== ========= =======================================
3456
3457 ..
3458
3459 Kernel Dispatch
3460 ~~~~~~~~~~~~~~~
3461
3462 The HSA architected queuing language (AQL) defines a user space memory interface
3463 that can be used to control the dispatch of kernels, in an agent independent
3464 way. An agent can have zero or more AQL queues created for it using an HSA
3465 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3466 are 64 bytes) can be placed. See the *HSA Platform System Architecture
3467 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3468
3469 The packet processor of a kernel agent is responsible for detecting and
3470 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3471 packet processor is implemented by the hardware command processor (CP),
3472 asynchronous dispatch controller (ADC) and shader processor input controller
3473 (SPI).
3474
3475 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3476 the kernel mode driver to initialize and register the AQL queue with CP.
3477
3478 To dispatch a kernel the following actions are performed. This can occur in the
3479 CPU host program, or from an HSA kernel executing on a GPU.
3480
3481 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3482    executed is obtained.
3483 2. A pointer to the kernel descriptor (see
3484    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3485    It must be for a kernel that is contained in a code object that that was
3486    loaded by an HSA compatible runtime on the kernel agent with which the AQL
3487    queue is associated.
3488 3. Space is allocated for the kernel arguments using the HSA compatible runtime
3489    allocator for a memory region with the kernarg property for the kernel agent
3490    that will execute the kernel. It must be at least 16-byte aligned.
3491 4. Kernel argument values are assigned to the kernel argument memory
3492    allocation. The layout is defined in the *HSA Programmer's Language
3493    Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3494    kernel argument memory in the same way constant memory is accessed. (Note
3495    that the HSA specification allows an implementation to copy the kernel
3496    argument contents to another location that is accessed by the kernel.)
3497 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3498    runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3499    for the packet. The packet must be set up, and the final write must use an
3500    atomic store release to set the packet kind to ensure the packet contents are
3501    visible to the kernel agent. AQL defines a doorbell signal mechanism to
3502    notify the kernel agent that the AQL queue has been updated. These rules, and
3503    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3504    System Architecture Specification* [HSA]_.
3505 6. A kernel dispatch packet includes information about the actual dispatch,
3506    such as grid and work-group size, together with information from the code
3507    object about the kernel, such as segment sizes. The HSA compatible runtime
3508    queries on the kernel symbol can be used to obtain the code object values
3509    which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
3510 7. CP executes micro-code and is responsible for detecting and setting up the
3511    GPU to execute the wavefronts of a kernel dispatch.
3512 8. CP ensures that when the a wavefront starts executing the kernel machine
3513    code, the scalar general purpose registers (SGPR) and vector general purpose
3514    registers (VGPR) are set up as required by the machine code. The required
3515    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3516    register state is defined in
3517    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
3518 9. The prolog of the kernel machine code (see
3519    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3520    before continuing executing the machine code that corresponds to the kernel.
3521 10. When the kernel dispatch has completed execution, CP signals the completion
3522     signal specified in the kernel dispatch packet if not 0.
3523
3524 .. _amdgpu-amdhsa-memory-spaces:
3525
3526 Memory Spaces
3527 ~~~~~~~~~~~~~
3528
3529 The memory space properties are:
3530
3531   .. table:: AMDHSA Memory Spaces
3532      :name: amdgpu-amdhsa-memory-spaces-table
3533
3534      ================= =========== ======== ======= ==================
3535      Memory Space Name HSA Segment Hardware Address NULL Value
3536                        Name        Name     Size
3537      ================= =========== ======== ======= ==================
3538      Private           private     scratch  32      0x00000000
3539      Local             group       LDS      32      0xFFFFFFFF
3540      Global            global      global   64      0x0000000000000000
3541      Constant          constant    *same as 64      0x0000000000000000
3542                                    global*
3543      Generic           flat        flat     64      0x0000000000000000
3544      Region            N/A         GDS      32      *not implemented
3545                                                     for AMDHSA*
3546      ================= =========== ======== ======= ==================
3547
3548 The global and constant memory spaces both use global virtual addresses, which
3549 are the same virtual address space used by the CPU. However, some virtual
3550 addresses may only be accessible to the CPU, some only accessible by the GPU,
3551 and some by both.
3552
3553 Using the constant memory space indicates that the data will not change during
3554 the execution of the kernel. This allows scalar read instructions to be
3555 used. The vector and scalar L1 caches are invalidated of volatile data before
3556 each kernel dispatch execution to allow constant memory to change values between
3557 kernel dispatches.
3558
3559 The local memory space uses the hardware Local Data Store (LDS) which is
3560 automatically allocated when the hardware creates work-groups of wavefronts, and
3561 freed when all the wavefronts of a work-group have terminated. The data store
3562 (DS) instructions can be used to access it.
3563
3564 The private memory space uses the hardware scratch memory support. If the kernel
3565 uses scratch, then the hardware allocates memory that is accessed using
3566 wavefront lane dword (4 byte) interleaving. The mapping used from private
3567 address to physical address is:
3568
3569   ``wavefront-scratch-base +
3570   (private-address * wavefront-size * 4) +
3571   (wavefront-lane-id * 4)``
3572
3573 There are different ways that the wavefront scratch base address is determined
3574 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3575 memory can be accessed in an interleaved manner using buffer instruction with
3576 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3577 instructions, or by flat instructions. If each lane of a wavefront accesses the
3578 same private address, the interleaving results in adjacent dwords being accessed
3579 and hence requires fewer cache lines to be fetched. Multi-dword access is not
3580 supported except by flat and scratch instructions in GFX9-GFX10.
3581
3582 The generic address space uses the hardware flat address support available in
3583 GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
3584 local apertures), that are outside the range of addressible global memory, to
3585 map from a flat address to a private or local address.
3586
3587 FLAT instructions can take a flat address and access global, private (scratch)
3588 and group (LDS) memory depending in if the address is within one of the
3589 aperture ranges. Flat access to scratch requires hardware aperture setup and
3590 setup in the kernel prologue (see
3591 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3592 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3593 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3594
3595 To convert between a segment address and a flat address the base address of the
3596 apertures address can be used. For GFX7-GFX8 these are available in the
3597 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3598 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3599 GFX9-GFX10 the aperture base addresses are directly available as inline constant
3600 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3601 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3602 which makes it easier to convert from flat to segment or segment to flat.
3603
3604 Image and Samplers
3605 ~~~~~~~~~~~~~~~~~~
3606
3607 Image and sample handles created by an HSA compatible runtime (see
3608 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3609 object respectively. In order to support the HSA ``query_sampler`` operations
3610 two extra dwords are used to store the HSA BRIG enumeration values for the
3611 queries that are not trivially deducible from the S# representation.
3612
3613 HSA Signals
3614 ~~~~~~~~~~~
3615
3616 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3617 are 64-bit addresses of a structure allocated in memory accessible from both the
3618 CPU and GPU. The structure is defined by the runtime and subject to change
3619 between releases. For example, see [AMD-ROCm-github]_.
3620
3621 .. _amdgpu-amdhsa-hsa-aql-queue:
3622
3623 HSA AQL Queue
3624 ~~~~~~~~~~~~~
3625
3626 The HSA AQL queue structure is defined by an HSA compatible runtime (see
3627 :ref:`amdgpu-os`) and subject to change between releases. For example, see
3628 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
3629 certain language features such as the flat address aperture bases. It also
3630 contains fields used by CP such as managing the allocation of scratch memory.
3631
3632 .. _amdgpu-amdhsa-kernel-descriptor:
3633
3634 Kernel Descriptor
3635 ~~~~~~~~~~~~~~~~~
3636
3637 A kernel descriptor consists of the information needed by CP to initiate the
3638 execution of a kernel, including the entry point address of the machine code
3639 that implements the kernel.
3640
3641 Code Object V3 Kernel Descriptor
3642 ++++++++++++++++++++++++++++++++
3643
3644 CP microcode requires the Kernel descriptor to be allocated on 64-byte
3645 alignment.
3646
3647 The fields used by CP for code objects before V3 also match those specified in
3648 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3649
3650   .. table:: Code Object V3 Kernel Descriptor
3651      :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3652
3653      ======= ======= =============================== ============================
3654      Bits    Size    Field Name                      Description
3655      ======= ======= =============================== ============================
3656      31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
3657                                                      address space memory
3658                                                      required for a work-group
3659                                                      in bytes. This does not
3660                                                      include any dynamically
3661                                                      allocated local address
3662                                                      space memory that may be
3663                                                      added when the kernel is
3664                                                      dispatched.
3665      63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
3666                                                      private address space
3667                                                      memory required for a
3668                                                      work-item in bytes.
3669                                                      Additional space may need to
3670                                                      be added to this value if
3671                                                      the call stack has
3672                                                      non-inlined function calls.
3673      95:64   4 bytes KERNARG_SIZE                    The size of the kernarg
3674                                                      memory pointed to by the
3675                                                      AQL dispatch packet. The
3676                                                      kernarg memory is used to
3677                                                      pass arguments to the
3678                                                      kernel.
3679
3680                                                      * If the kernarg pointer in
3681                                                        the dispatch packet is NULL
3682                                                        then there are no kernel
3683                                                        arguments.
3684                                                      * If the kernarg pointer in
3685                                                        the dispatch packet is
3686                                                        not NULL and this value
3687                                                        is 0 then the kernarg
3688                                                        memory size is
3689                                                        unspecified.
3690                                                      * If the kernarg pointer in
3691                                                        the dispatch packet is
3692                                                        not NULL and this value
3693                                                        is not 0 then the value
3694                                                        specifies the kernarg
3695                                                        memory size in bytes. It
3696                                                        is recommended to provide
3697                                                        a value as it may be used
3698                                                        by CP to optimize making
3699                                                        the kernarg memory
3700                                                        visible to the kernel
3701                                                        code.
3702
3703      127:96  4 bytes                                 Reserved, must be 0.
3704      191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
3705                                                      negative) from base
3706                                                      address of kernel
3707                                                      descriptor to kernel's
3708                                                      entry point instruction
3709                                                      which must be 256 byte
3710                                                      aligned.
3711      351:272 20                                      Reserved, must be 0.
3712              bytes
3713      383:352 4 bytes COMPUTE_PGM_RSRC3               GFX6-GFX9
3714                                                        Reserved, must be 0.
3715                                                      GFX90A
3716                                                        Compute Shader (CS)
3717                                                        program settings used by
3718                                                        CP to set up
3719                                                        ``COMPUTE_PGM_RSRC3``
3720                                                        configuration
3721                                                        register. See
3722                                                        :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
3723                                                      GFX10
3724                                                        Compute Shader (CS)
3725                                                        program settings used by
3726                                                        CP to set up
3727                                                        ``COMPUTE_PGM_RSRC3``
3728                                                        configuration
3729                                                        register. See
3730                                                        :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
3731      415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
3732                                                      program settings used by
3733                                                      CP to set up
3734                                                      ``COMPUTE_PGM_RSRC1``
3735                                                      configuration
3736                                                      register. See
3737                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
3738      447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
3739                                                      program settings used by
3740                                                      CP to set up
3741                                                      ``COMPUTE_PGM_RSRC2``
3742                                                      configuration
3743                                                      register. See
3744                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
3745      458:448 7 bits  *See separate bits below.*      Enable the setup of the
3746                                                      SGPR user data registers
3747                                                      (see
3748                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3749
3750                                                      The total number of SGPR
3751                                                      user data registers
3752                                                      requested must not exceed
3753                                                      16 and match value in
3754                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
3755                                                      Any requests beyond 16
3756                                                      will be ignored.
3757      >448    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     If the *Target Properties*
3758                      _BUFFER                         column of
3759                                                      :ref:`amdgpu-processor-table`
3760                                                      specifies *Architected flat
3761                                                      scratch* then not supported
3762                                                      and must be 0,
3763      >449    1 bit   ENABLE_SGPR_DISPATCH_PTR
3764      >450    1 bit   ENABLE_SGPR_QUEUE_PTR
3765      >451    1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR
3766      >452    1 bit   ENABLE_SGPR_DISPATCH_ID
3767      >453    1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   If the *Target Properties*
3768                                                      column of
3769                                                      :ref:`amdgpu-processor-table`
3770                                                      specifies *Architected flat
3771                                                      scratch* then not supported
3772                                                      and must be 0,
3773      >454    1 bit   ENABLE_SGPR_PRIVATE_SEGMENT
3774                      _SIZE
3775      457:455 3 bits                                  Reserved, must be 0.
3776      458     1 bit   ENABLE_WAVEFRONT_SIZE32         GFX6-GFX9
3777                                                        Reserved, must be 0.
3778                                                      GFX10
3779                                                        - If 0 execute in
3780                                                          wavefront size 64 mode.
3781                                                        - If 1 execute in
3782                                                          native wavefront size
3783                                                          32 mode.
3784      463:459 1 bit                                   Reserved, must be 0.
3785      464     1 bit   RESERVED_464                    Deprecated, must be 0.
3786      467:465 3 bits                                  Reserved, must be 0.
3787      468     1 bit   RESERVED_468                    Deprecated, must be 0.
3788      469:471 3 bits                                  Reserved, must be 0.
3789      511:472 5 bytes                                 Reserved, must be 0.
3790      512     **Total size 64 bytes.**
3791      ======= ====================================================================
3792
3793 ..
3794
3795   .. table:: compute_pgm_rsrc1 for GFX6-GFX10
3796      :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
3797
3798      ======= ======= =============================== ===========================================================================
3799      Bits    Size    Field Name                      Description
3800      ======= ======= =============================== ===========================================================================
3801      5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
3802                                                      blocks used by each work-item;
3803                                                      granularity is device
3804                                                      specific:
3805
3806                                                      GFX6-GFX9
3807                                                        - vgprs_used 0..256
3808                                                        - max(0, ceil(vgprs_used / 4) - 1)
3809                                                      GFX90A
3810                                                        - vgprs_used 0..512
3811                                                        - vgprs_used = align(arch_vgprs, 4)
3812                                                                       + acc_vgprs
3813                                                        - max(0, ceil(vgprs_used / 8) - 1)
3814                                                      GFX10 (wavefront size 64)
3815                                                        - max_vgpr 1..256
3816                                                        - max(0, ceil(vgprs_used / 4) - 1)
3817                                                      GFX10 (wavefront size 32)
3818                                                        - max_vgpr 1..256
3819                                                        - max(0, ceil(vgprs_used / 8) - 1)
3820
3821                                                      Where vgprs_used is defined
3822                                                      as the highest VGPR number
3823                                                      explicitly referenced plus
3824                                                      one.
3825
3826                                                      Used by CP to set up
3827                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
3828
3829                                                      The
3830                                                      :ref:`amdgpu-assembler`
3831                                                      calculates this
3832                                                      automatically for the
3833                                                      selected processor from
3834                                                      values provided to the
3835                                                      `.amdhsa_kernel` directive
3836                                                      by the
3837                                                      `.amdhsa_next_free_vgpr`
3838                                                      nested directive (see
3839                                                      :ref:`amdhsa-kernel-directives-table`).
3840      9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
3841                                                      blocks used by a wavefront;
3842                                                      granularity is device
3843                                                      specific:
3844
3845                                                      GFX6-GFX8
3846                                                        - sgprs_used 0..112
3847                                                        - max(0, ceil(sgprs_used / 8) - 1)
3848                                                      GFX9
3849                                                        - sgprs_used 0..112
3850                                                        - 2 * max(0, ceil(sgprs_used / 16) - 1)
3851                                                      GFX10
3852                                                        Reserved, must be 0.
3853                                                        (128 SGPRs always
3854                                                        allocated.)
3855
3856                                                      Where sgprs_used is
3857                                                      defined as the highest
3858                                                      SGPR number explicitly
3859                                                      referenced plus one, plus
3860                                                      a target specific number
3861                                                      of additional special
3862                                                      SGPRs for VCC,
3863                                                      FLAT_SCRATCH (GFX7+) and
3864                                                      XNACK_MASK (GFX8+), and
3865                                                      any additional
3866                                                      target specific
3867                                                      limitations. It does not
3868                                                      include the 16 SGPRs added
3869                                                      if a trap handler is
3870                                                      enabled.
3871
3872                                                      The target specific
3873                                                      limitations and special
3874                                                      SGPR layout are defined in
3875                                                      the hardware
3876                                                      documentation, which can
3877                                                      be found in the
3878                                                      :ref:`amdgpu-processors`
3879                                                      table.
3880
3881                                                      Used by CP to set up
3882                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
3883
3884                                                      The
3885                                                      :ref:`amdgpu-assembler`
3886                                                      calculates this
3887                                                      automatically for the
3888                                                      selected processor from
3889                                                      values provided to the
3890                                                      `.amdhsa_kernel` directive
3891                                                      by the
3892                                                      `.amdhsa_next_free_sgpr`
3893                                                      and `.amdhsa_reserve_*`
3894                                                      nested directives (see
3895                                                      :ref:`amdhsa-kernel-directives-table`).
3896      11:10   2 bits  PRIORITY                        Must be 0.
3897
3898                                                      Start executing wavefront
3899                                                      at the specified priority.
3900
3901                                                      CP is responsible for
3902                                                      filling in
3903                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
3904      13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
3905                                                      with specified rounding
3906                                                      mode for single (32
3907                                                      bit) floating point
3908                                                      precision floating point
3909                                                      operations.
3910
3911                                                      Floating point rounding
3912                                                      mode values are defined in
3913                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3914
3915                                                      Used by CP to set up
3916                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3917      15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
3918                                                      with specified rounding
3919                                                      denorm mode for half/double (16
3920                                                      and 64-bit) floating point
3921                                                      precision floating point
3922                                                      operations.
3923
3924                                                      Floating point rounding
3925                                                      mode values are defined in
3926                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3927
3928                                                      Used by CP to set up
3929                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3930      17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
3931                                                      with specified denorm mode
3932                                                      for single (32
3933                                                      bit)  floating point
3934                                                      precision floating point
3935                                                      operations.
3936
3937                                                      Floating point denorm mode
3938                                                      values are defined in
3939                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3940
3941                                                      Used by CP to set up
3942                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3943      19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
3944                                                      with specified denorm mode
3945                                                      for half/double (16
3946                                                      and 64-bit) floating point
3947                                                      precision floating point
3948                                                      operations.
3949
3950                                                      Floating point denorm mode
3951                                                      values are defined in
3952                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3953
3954                                                      Used by CP to set up
3955                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3956      20      1 bit   PRIV                            Must be 0.
3957
3958                                                      Start executing wavefront
3959                                                      in privilege trap handler
3960                                                      mode.
3961
3962                                                      CP is responsible for
3963                                                      filling in
3964                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
3965      21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
3966                                                      with DX10 clamp mode
3967                                                      enabled. Used by the vector
3968                                                      ALU to force DX10 style
3969                                                      treatment of NaN's (when
3970                                                      set, clamp NaN to zero,
3971                                                      otherwise pass NaN
3972                                                      through).
3973
3974                                                      Used by CP to set up
3975                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
3976      22      1 bit   DEBUG_MODE                      Must be 0.
3977
3978                                                      Start executing wavefront
3979                                                      in single step mode.
3980
3981                                                      CP is responsible for
3982                                                      filling in
3983                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
3984      23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
3985                                                      with IEEE mode
3986                                                      enabled. Floating point
3987                                                      opcodes that support
3988                                                      exception flag gathering
3989                                                      will quiet and propagate
3990                                                      signaling-NaN inputs per
3991                                                      IEEE 754-2008. Min_dx10 and
3992                                                      max_dx10 become IEEE
3993                                                      754-2008 compliant due to
3994                                                      signaling-NaN propagation
3995                                                      and quieting.
3996
3997                                                      Used by CP to set up
3998                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
3999      24      1 bit   BULKY                           Must be 0.
4000
4001                                                      Only one work-group allowed
4002                                                      to execute on a compute
4003                                                      unit.
4004
4005                                                      CP is responsible for
4006                                                      filling in
4007                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
4008      25      1 bit   CDBG_USER                       Must be 0.
4009
4010                                                      Flag that can be used to
4011                                                      control debugging code.
4012
4013                                                      CP is responsible for
4014                                                      filling in
4015                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4016      26      1 bit   FP16_OVFL                       GFX6-GFX8
4017                                                        Reserved, must be 0.
4018                                                      GFX9-GFX10
4019                                                        Wavefront starts execution
4020                                                        with specified fp16 overflow
4021                                                        mode.
4022
4023                                                        - If 0, fp16 overflow generates
4024                                                          +/-INF values.
4025                                                        - If 1, fp16 overflow that is the
4026                                                          result of an +/-INF input value
4027                                                          or divide by 0 produces a +/-INF,
4028                                                          otherwise clamps computed
4029                                                          overflow to +/-MAX_FP16 as
4030                                                          appropriate.
4031
4032                                                        Used by CP to set up
4033                                                        ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4034      28:27   2 bits                                  Reserved, must be 0.
4035      29      1 bit    WGP_MODE                       GFX6-GFX9
4036                                                        Reserved, must be 0.
4037                                                      GFX10
4038                                                        - If 0 execute work-groups in
4039                                                          CU wavefront execution mode.
4040                                                        - If 1 execute work-groups on
4041                                                          in WGP wavefront execution mode.
4042
4043                                                        See :ref:`amdgpu-amdhsa-memory-model`.
4044
4045                                                        Used by CP to set up
4046                                                        ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4047      30      1 bit    MEM_ORDERED                    GFX6-GFX9
4048                                                        Reserved, must be 0.
4049                                                      GFX10
4050                                                        Controls the behavior of the
4051                                                        s_waitcnt's vmcnt and vscnt
4052                                                        counters.
4053
4054                                                        - If 0 vmcnt reports completion
4055                                                          of load and atomic with return
4056                                                          out of order with sample
4057                                                          instructions, and the vscnt
4058                                                          reports the completion of
4059                                                          store and atomic without
4060                                                          return in order.
4061                                                        - If 1 vmcnt reports completion
4062                                                          of load, atomic with return
4063                                                          and sample instructions in
4064                                                          order, and the vscnt reports
4065                                                          the completion of store and
4066                                                          atomic without return in order.
4067
4068                                                        Used by CP to set up
4069                                                        ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4070      31      1 bit    FWD_PROGRESS                   GFX6-GFX9
4071                                                        Reserved, must be 0.
4072                                                      GFX10
4073                                                        - If 0 execute SIMD wavefronts
4074                                                          using oldest first policy.
4075                                                        - If 1 execute SIMD wavefronts to
4076                                                          ensure wavefronts will make some
4077                                                          forward progress.
4078
4079                                                        Used by CP to set up
4080                                                        ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4081      32      **Total size 4 bytes**
4082      ======= ===================================================================================================================
4083
4084 ..
4085
4086   .. table:: compute_pgm_rsrc2 for GFX6-GFX10
4087      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
4088
4089      ======= ======= =============================== ===========================================================================
4090      Bits    Size    Field Name                      Description
4091      ======= ======= =============================== ===========================================================================
4092      0       1 bit   ENABLE_PRIVATE_SEGMENT          * Enable the setup of the
4093                                                        private segment.
4094                                                      * If the *Target Properties*
4095                                                        column of
4096                                                        :ref:`amdgpu-processor-table`
4097                                                        does not specify
4098                                                        *Architected flat
4099                                                        scratch* then enable the
4100                                                        setup of the SGPR
4101                                                        wavefront scratch offset
4102                                                        system register (see
4103                                                        :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4104                                                      * If the *Target Properties*
4105                                                        column of
4106                                                        :ref:`amdgpu-processor-table`
4107                                                        specifies *Architected
4108                                                        flat scratch* then enable
4109                                                        the setup of the
4110                                                        FLAT_SCRATCH register
4111                                                        pair (see
4112                                                        :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4113
4114                                                      Used by CP to set up
4115                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4116      5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
4117                                                      user data registers
4118                                                      requested. This number must
4119                                                      match the number of user
4120                                                      data registers enabled.
4121
4122                                                      Used by CP to set up
4123                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4124      6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
4125
4126                                                      This bit represents
4127                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4128                                                      which is set by the CP if
4129                                                      the runtime has installed a
4130                                                      trap handler.
4131      7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
4132                                                      system SGPR register for
4133                                                      the work-group id in the X
4134                                                      dimension (see
4135                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4136
4137                                                      Used by CP to set up
4138                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4139      8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
4140                                                      system SGPR register for
4141                                                      the work-group id in the Y
4142                                                      dimension (see
4143                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4144
4145                                                      Used by CP to set up
4146                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4147      9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
4148                                                      system SGPR register for
4149                                                      the work-group id in the Z
4150                                                      dimension (see
4151                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4152
4153                                                      Used by CP to set up
4154                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4155      10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
4156                                                      system SGPR register for
4157                                                      work-group information (see
4158                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4159
4160                                                      Used by CP to set up
4161                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4162      12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
4163                                                      VGPR system registers used
4164                                                      for the work-item ID.
4165                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4166                                                      defines the values.
4167
4168                                                      Used by CP to set up
4169                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4170      13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
4171
4172                                                      Wavefront starts execution
4173                                                      with address watch
4174                                                      exceptions enabled which
4175                                                      are generated when L1 has
4176                                                      witnessed a thread access
4177                                                      an *address of
4178                                                      interest*.
4179
4180                                                      CP is responsible for
4181                                                      filling in the address
4182                                                      watch bit in
4183                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4184                                                      according to what the
4185                                                      runtime requests.
4186      14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
4187
4188                                                      Wavefront starts execution
4189                                                      with memory violation
4190                                                      exceptions exceptions
4191                                                      enabled which are generated
4192                                                      when a memory violation has
4193                                                      occurred for this wavefront from
4194                                                      L1 or LDS
4195                                                      (write-to-read-only-memory,
4196                                                      mis-aligned atomic, LDS
4197                                                      address out of range,
4198                                                      illegal address, etc.).
4199
4200                                                      CP sets the memory
4201                                                      violation bit in
4202                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4203                                                      according to what the
4204                                                      runtime requests.
4205      23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
4206
4207                                                      CP uses the rounded value
4208                                                      from the dispatch packet,
4209                                                      not this value, as the
4210                                                      dispatch may contain
4211                                                      dynamically allocated group
4212                                                      segment memory. CP writes
4213                                                      directly to
4214                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4215
4216                                                      Amount of group segment
4217                                                      (LDS) to allocate for each
4218                                                      work-group. Granularity is
4219                                                      device specific:
4220
4221                                                      GFX6
4222                                                        roundup(lds-size / (64 * 4))
4223                                                      GFX7-GFX10
4224                                                        roundup(lds-size / (128 * 4))
4225
4226      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
4227                      _INVALID_OPERATION              with specified exceptions
4228                                                      enabled.
4229
4230                                                      Used by CP to set up
4231                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
4232                                                      (set from bits 0..6).
4233
4234                                                      IEEE 754 FP Invalid
4235                                                      Operation
4236      25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
4237                      _SOURCE                         input operands is a
4238                                                      denormal number
4239      26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
4240                      _DIVISION_BY_ZERO               Zero
4241      27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
4242                      _OVERFLOW
4243      28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
4244                      _UNDERFLOW
4245      29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
4246                      _INEXACT
4247      30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
4248                      _ZERO                           (rcp_iflag_f32 instruction
4249                                                      only)
4250      31      1 bit                                   Reserved, must be 0.
4251      32      **Total size 4 bytes.**
4252      ======= ===================================================================================================================
4253
4254 ..
4255
4256   .. table:: compute_pgm_rsrc3 for GFX90A
4257      :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4258
4259      ======= ======= =============================== ===========================================================================
4260      Bits    Size    Field Name                      Description
4261      ======= ======= =============================== ===========================================================================
4262      5:0     6 bits  ACCUM_OFFSET                    Offset of a first AccVGPR in the unified register file. Granularity 4.
4263                                                      Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4264                                                      63 - accum-offset = 256.
4265      6:15    10                                      Reserved, must be 0.
4266              bits
4267      16      1 bit   TG_SPLIT                        - If 0 the waves of a work-group are
4268                                                        launched in the same CU.
4269                                                      - If 1 the waves of a work-group can be
4270                                                        launched in different CUs. The waves
4271                                                        cannot use S_BARRIER or LDS.
4272      17:31   15                                      Reserved, must be 0.
4273              bits
4274      32      **Total size 4 bytes.**
4275      ======= ===================================================================================================================
4276
4277 ..
4278
4279   .. table:: compute_pgm_rsrc3 for GFX10
4280      :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
4281
4282      ======= ======= =============================== ===========================================================================
4283      Bits    Size    Field Name                      Description
4284      ======= ======= =============================== ===========================================================================
4285      3:0     4 bits  SHARED_VGPR_COUNT               Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
4286                                                      compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
4287      31:4    28                                      Reserved, must be 0.
4288              bits
4289      32      **Total size 4 bytes.**
4290      ======= ===================================================================================================================
4291
4292 ..
4293
4294   .. table:: Floating Point Rounding Mode Enumeration Values
4295      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4296
4297      ====================================== ===== ==============================
4298      Enumeration Name                       Value Description
4299      ====================================== ===== ==============================
4300      FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
4301      FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
4302      FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
4303      FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
4304      ====================================== ===== ==============================
4305
4306 ..
4307
4308   .. table:: Floating Point Denorm Mode Enumeration Values
4309      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4310
4311      ====================================== ===== ==============================
4312      Enumeration Name                       Value Description
4313      ====================================== ===== ==============================
4314      FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
4315                                                   Denorms
4316      FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
4317      FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
4318      FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
4319      ====================================== ===== ==============================
4320
4321 ..
4322
4323   .. table:: System VGPR Work-Item ID Enumeration Values
4324      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4325
4326      ======================================== ===== ============================
4327      Enumeration Name                         Value Description
4328      ======================================== ===== ============================
4329      SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
4330                                                     ID.
4331      SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
4332                                                     dimensions ID.
4333      SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
4334                                                     dimensions ID.
4335      SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
4336      ======================================== ===== ============================
4337
4338 .. _amdgpu-amdhsa-initial-kernel-execution-state:
4339
4340 Initial Kernel Execution State
4341 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4342
4343 This section defines the register state that will be set up by the packet
4344 processor prior to the start of execution of every wavefront. This is limited by
4345 the constraints of the hardware controllers of CP/ADC/SPI.
4346
4347 The order of the SGPR registers is defined, but the compiler can specify which
4348 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4349 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4350 for enabled registers are dense starting at SGPR0: the first enabled register is
4351 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4352 an SGPR number.
4353
4354 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4355 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4356 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4357 actually initialized. These are then immediately followed by the System SGPRs
4358 that are set up by ADC/SPI and can have different values for each wavefront of
4359 the grid dispatch.
4360
4361 SGPR register initial state is defined in
4362 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4363
4364   .. table:: SGPR Register Set Up Order
4365      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4366
4367      ========== ========================== ====== ==============================
4368      SGPR Order Name                       Number Description
4369                 (kernel descriptor enable  of
4370                 field)                     SGPRs
4371      ========== ========================== ====== ==============================
4372      First      Private Segment Buffer     4      See
4373                 (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4374                 _segment_buffer)
4375      then       Dispatch Ptr               2      64-bit address of AQL dispatch
4376                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
4377                                                   actually executing.
4378      then       Queue Ptr                  2      64-bit address of amd_queue_t
4379                 (enable_sgpr_queue_ptr)           object for AQL queue on which
4380                                                   the dispatch packet was
4381                                                   queued.
4382      then       Kernarg Segment Ptr        2      64-bit address of Kernarg
4383                 (enable_sgpr_kernarg              segment. This is directly
4384                 _segment_ptr)                     copied from the
4385                                                   kernarg_address in the kernel
4386                                                   dispatch packet.
4387
4388                                                   Having CP load it once avoids
4389                                                   loading it at the beginning of
4390                                                   every wavefront.
4391      then       Dispatch Id                2      64-bit Dispatch ID of the
4392                 (enable_sgpr_dispatch_id)         dispatch packet being
4393                                                   executed.
4394      then       Flat Scratch Init          2      See
4395                 (enable_sgpr_flat_scratch         :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4396                 _init)
4397      then       Private Segment Size       1      The 32-bit byte size of a
4398                 (enable_sgpr_private              single work-item's memory
4399                 _segment_size)                    allocation. This is the
4400                                                   value from the kernel
4401                                                   dispatch packet Private
4402                                                   Segment Byte Size rounded up
4403                                                   by CP to a multiple of
4404                                                   DWORD.
4405
4406                                                   Having CP load it once avoids
4407                                                   loading it at the beginning of
4408                                                   every wavefront.
4409
4410                                                   This is not used for
4411                                                   GFX7-GFX8 since it is the same
4412                                                   value as the second SGPR of
4413                                                   Flat Scratch Init. However, it
4414                                                   may be needed for GFX9-GFX10 which
4415                                                   changes the meaning of the
4416                                                   Flat Scratch Init value.
4417      then       Work-Group Id X            1      32-bit work-group id in X
4418                 (enable_sgpr_workgroup_id         dimension of grid for
4419                 _X)                               wavefront.
4420      then       Work-Group Id Y            1      32-bit work-group id in Y
4421                 (enable_sgpr_workgroup_id         dimension of grid for
4422                 _Y)                               wavefront.
4423      then       Work-Group Id Z            1      32-bit work-group id in Z
4424                 (enable_sgpr_workgroup_id         dimension of grid for
4425                 _Z)                               wavefront.
4426      then       Work-Group Info            1      {first_wavefront, 14'b0000,
4427                 (enable_sgpr_workgroup            ordered_append_term[10:0],
4428                 _info)                            threadgroup_size_in_wavefronts[5:0]}
4429      then       Scratch Wavefront Offset   1      See
4430                 (enable_sgpr_private              :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4431                 _segment_wavefront_offset)        and
4432                                                   :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4433      ========== ========================== ====== ==============================
4434
4435 The order of the VGPR registers is defined, but the compiler can specify which
4436 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4437 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4438 for enabled registers are dense starting at VGPR0: the first enabled register is
4439 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4440 VGPR number.
4441
4442 There are different methods used for the VGPR initial state:
4443
4444 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4445   specifies otherwise, a separate VGPR register is used per work-item ID. The
4446   VGPR register initial state for this method is defined in
4447   :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4448 * If *Target Properties* column of :ref:`amdgpu-processor-table`
4449   specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4450   for all work-item IDs. The register layout for this method is defined in
4451   :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4452
4453   .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4454      :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4455
4456      ========== ========================== ====== ==============================
4457      VGPR Order Name                       Number Description
4458                 (kernel descriptor enable  of
4459                 field)                     VGPRs
4460      ========== ========================== ====== ==============================
4461      First      Work-Item Id X             1      32-bit work-item id in X
4462                 (Always initialized)              dimension of work-group for
4463                                                   wavefront lane.
4464      then       Work-Item Id Y             1      32-bit work-item id in Y
4465                 (enable_vgpr_workitem_id          dimension of work-group for
4466                 > 0)                              wavefront lane.
4467      then       Work-Item Id Z             1      32-bit work-item id in Z
4468                 (enable_vgpr_workitem_id          dimension of work-group for
4469                 > 1)                              wavefront lane.
4470      ========== ========================== ====== ==============================
4471
4472 ..
4473
4474   .. table:: Register Layout for Packed Work-Item ID Method
4475      :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4476
4477      ======= ======= ================ =========================================
4478      Bits    Size    Field Name       Description
4479      ======= ======= ================ =========================================
4480      0:9     10 bits Work-Item Id X   Work-item id in X
4481                                       dimension of work-group for
4482                                       wavefront lane.
4483
4484                                       Always initialized.
4485
4486      10:19   10 bits Work-Item Id Y   Work-item id in Y
4487                                       dimension of work-group for
4488                                       wavefront lane.
4489
4490                                       Initialized if enable_vgpr_workitem_id >
4491                                       0, otherwise set to 0.
4492      20:29   10 bits Work-Item Id Z   Work-item id in Z
4493                                       dimension of work-group for
4494                                       wavefront lane.
4495
4496                                       Initialized if enable_vgpr_workitem_id >
4497                                       1, otherwise set to 0.
4498      30:31   2 bits                   Reserved, set to 0.
4499      ======= ======= ================ =========================================
4500
4501 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4502
4503 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4504    registers.
4505 2. Work-group Id registers X, Y, Z are set by ADC which supports any
4506    combination including none.
4507 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4508    its value cannot be included with the flat scratch init value which is per
4509    queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
4510 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4511    or (X, Y, Z).
4512 5. Flat Scratch register pair initialization is described in
4513    :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4514
4515 The global segment can be accessed either using buffer instructions (GFX6 which
4516 has V# 64-bit address support), flat instructions (GFX7-GFX10), or global
4517 instructions (GFX9-GFX10).
4518
4519 If buffer operations are used, then the compiler can generate a V# with the
4520 following properties:
4521
4522 * base address of 0
4523 * no swizzle
4524 * ATC: 1 if IOMMU present (such as APU)
4525 * ptr64: 1
4526 * MTYPE set to support memory coherence that matches the runtime (such as CC for
4527   APU and NC for dGPU).
4528
4529 .. _amdgpu-amdhsa-kernel-prolog:
4530
4531 Kernel Prolog
4532 ~~~~~~~~~~~~~
4533
4534 The compiler performs initialization in the kernel prologue depending on the
4535 target and information about things like stack usage in the kernel and called
4536 functions. Some of this initialization requires the compiler to request certain
4537 User and System SGPRs be present in the
4538 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4539 :ref:`amdgpu-amdhsa-kernel-descriptor`.
4540
4541 .. _amdgpu-amdhsa-kernel-prolog-cfi:
4542
4543 CFI
4544 +++
4545
4546 1.  The CFI return address is undefined.
4547
4548 2.  The CFI CFA is defined using an expression which evaluates to a location
4549     description that comprises one memory location description for the
4550     ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4551
4552 .. _amdgpu-amdhsa-kernel-prolog-m0:
4553
4554 M0
4555 ++
4556
4557 GFX6-GFX8
4558   The M0 register must be initialized with a value at least the total LDS size
4559   if the kernel may access LDS via DS or flat operations. Total LDS size is
4560   available in dispatch packet. For M0, it is also possible to use maximum
4561   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4562   GFX7-GFX8).
4563 GFX9-GFX10
4564   The M0 register is not used for range checking LDS accesses and so does not
4565   need to be initialized in the prolog.
4566
4567 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4568
4569 Stack Pointer
4570 +++++++++++++
4571
4572 If the kernel has function calls it must set up the ABI stack pointer described
4573 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4574 SGPR32 to the unswizzled scratch offset of the address past the last local
4575 allocation.
4576
4577 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4578
4579 Frame Pointer
4580 +++++++++++++
4581
4582 If the kernel needs a frame pointer for the reasons defined in
4583 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4584 kernel prolog. If a frame pointer is not required then all uses of the frame
4585 pointer are replaced with immediate ``0`` offsets.
4586
4587 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4588
4589 Flat Scratch
4590 ++++++++++++
4591
4592 There are different methods used for initializing flat scratch:
4593
4594 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4595   specifies *Does not support generic address space*:
4596
4597   Flat scratch is not supported and there is no flat scratch register pair.
4598
4599 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4600   specifies *Offset flat scratch*:
4601
4602   If the kernel or any function it calls may use flat operations to access
4603   scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4604   (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4605   Scratch Wavefront Offset SGPR registers (see
4606   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4607
4608   1. The low word of Flat Scratch Init is the 32-bit byte offset from
4609      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4610      being managed by SPI for the queue executing the kernel dispatch. This is
4611      the same value used in the Scratch Segment Buffer V# base address.
4612
4613      CP obtains this from the runtime. (The Scratch Segment Buffer base address
4614      is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4615
4616      The prolog must add the value of Scratch Wavefront Offset to get the
4617      wavefront's byte scratch backing memory offset from
4618      ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4619
4620      The Scratch Wavefront Offset must also be used as an offset with Private
4621      segment address when using the Scratch Segment Buffer.
4622
4623      Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4624      shifted by 8 before moving into FLAT_SCRATCH_HI.
4625
4626      FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4627      SGPRn is the highest numbered SGPR allocated to the wavefront).
4628      FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4629      added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4630      FLAT SCRATCH BASE in flat memory instructions that access the scratch
4631      aperture.
4632   2. The second word of Flat Scratch Init is 32-bit byte size of a single
4633      work-items scratch memory usage.
4634
4635      CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4636      checks that the value in the kernel dispatch packet Private Segment Byte
4637      Size is not larger and requests the runtime to increase the queue's scratch
4638      size if necessary.
4639
4640      CP directly loads from the kernel dispatch packet Private Segment Byte Size
4641      field and rounds up to a multiple of DWORD. Having CP load it once avoids
4642      loading it at the beginning of every wavefront.
4643
4644      The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
4645      GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
4646      in flat memory instructions.
4647
4648 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4649   specifies *Absolute flat scratch*:
4650
4651   If the kernel or any function it calls may use flat operations to access
4652   scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4653   (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
4654   uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
4655   :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4656
4657   The Flat Scratch Init is the 64-bit address of the base of scratch backing
4658   memory being managed by SPI for the queue executing the kernel dispatch.
4659
4660   CP obtains this from the runtime.
4661
4662   The kernel prolog must add the value of the wave's Scratch Wavefront Offset
4663   and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
4664   which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
4665   memory instructions.
4666
4667   The Scratch Wavefront Offset must also be used as an offset with Private
4668   segment address when using the Scratch Segment Buffer (see
4669   :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
4670
4671 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4672   specifies *Architected flat scratch*:
4673
4674   If ENABLE_PRIVATE_SEGMENT is enabled in
4675   :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH
4676   register pair will be initialized to the 64-bit address of the base of scratch
4677   backing memory being managed by SPI for the queue executing the kernel
4678   dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
4679   flat scratch base in flat memory instructions.
4680
4681 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
4682
4683 Private Segment Buffer
4684 ++++++++++++++++++++++
4685
4686 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
4687 *Architected flat scratch* then a Private Segment Buffer is not supported.
4688 Instead the flat SCRATCH instructions are used.
4689
4690 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
4691 that are used as a V# to access scratch. CP uses the value provided by the
4692 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
4693 access the private memory space using a segment address. See
4694 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4695
4696 The scratch V# is a four-aligned SGPR and always selected for the kernel as
4697 follows:
4698
4699   - If it is known during instruction selection that there is stack usage,
4700     SGPR0-3 is reserved for use as the scratch V#.  Stack usage is assumed if
4701     optimizations are disabled (``-O0``), if stack objects already exist (for
4702     locals, etc.), or if there are any function calls.
4703
4704   - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
4705     are reserved for the tentative scratch V#. These will be used if it is
4706     determined that spilling is needed.
4707
4708     - If no use is made of the tentative scratch V#, then it is unreserved,
4709       and the register count is determined ignoring it.
4710     - If use is made of the tentative scratch V#, then its register numbers
4711       are shifted to the first four-aligned SGPR index after the highest one
4712       allocated by the register allocator, and all uses are updated. The
4713       register count includes them in the shifted location.
4714     - In either case, if the processor has the SGPR allocation bug, the
4715       tentative allocation is not shifted or unreserved in order to ensure
4716       the register count is higher to workaround the bug.
4717
4718     .. note::
4719
4720       This approach of using a tentative scratch V# and shifting the register
4721       numbers if used avoids having to perform register allocation a second
4722       time if the tentative V# is eliminated. This is more efficient and
4723       avoids the problem that the second register allocation may perform
4724       spilling which will fail as there is no longer a scratch V#.
4725
4726 When the kernel prolog code is being emitted it is known whether the scratch V#
4727 described above is actually used. If it is, the prolog code must set it up by
4728 copying the Private Segment Buffer to the scratch V# registers and then adding
4729 the Private Segment Wavefront Offset to the queue base address in the V#. The
4730 result is a V# with a base address pointing to the beginning of the wavefront
4731 scratch backing memory.
4732
4733 The Private Segment Buffer is always requested, but the Private Segment
4734 Wavefront Offset is only requested if it is used (see
4735 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4736
4737 .. _amdgpu-amdhsa-memory-model:
4738
4739 Memory Model
4740 ~~~~~~~~~~~~
4741
4742 This section describes the mapping of the LLVM memory model onto AMDGPU machine
4743 code (see :ref:`memmodel`).
4744
4745 The AMDGPU backend supports the memory synchronization scopes specified in
4746 :ref:`amdgpu-memory-scopes`.
4747
4748 The code sequences used to implement the memory model specify the order of
4749 instructions that a single thread must execute. The ``s_waitcnt`` and cache
4750 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
4751 to other memory instructions executed by the same thread. This allows them to be
4752 moved earlier or later which can allow them to be combined with other instances
4753 of the same instruction, or hoisted/sunk out of loops to improve performance.
4754 Only the instructions related to the memory model are given; additional
4755 ``s_waitcnt`` instructions are required to ensure registers are defined before
4756 being used. These may be able to be combined with the memory model ``s_waitcnt``
4757 instructions as described above.
4758
4759 The AMDGPU backend supports the following memory models:
4760
4761   HSA Memory Model [HSA]_
4762     The HSA memory model uses a single happens-before relation for all address
4763     spaces (see :ref:`amdgpu-address-spaces`).
4764   OpenCL Memory Model [OpenCL]_
4765     The OpenCL memory model which has separate happens-before relations for the
4766     global and local address spaces. Only a fence specifying both global and
4767     local address space, and seq_cst instructions join the relationships. Since
4768     the LLVM ``memfence`` instruction does not allow an address space to be
4769     specified the OpenCL fence has to conservatively assume both local and
4770     global address space was specified. However, optimizations can often be
4771     done to eliminate the additional ``s_waitcnt`` instructions when there are
4772     no intervening memory instructions which access the corresponding address
4773     space. The code sequences in the table indicate what can be omitted for the
4774     OpenCL memory. The target triple environment is used to determine if the
4775     source language is OpenCL (see :ref:`amdgpu-opencl`).
4776
4777 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
4778 operations.
4779
4780 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
4781 termed vector memory operations.
4782
4783 Private address space uses ``buffer_load/store`` using the scratch V#
4784 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread
4785 is accessing the memory, atomic memory orderings are not meaningful, and all
4786 accesses are treated as non-atomic.
4787
4788 Constant address space uses ``buffer/global_load`` instructions (or equivalent
4789 scalar memory instructions). Since the constant address space contents do not
4790 change during the execution of a kernel dispatch it is not legal to perform
4791 stores, and atomic memory orderings are not meaningful, and all accesses are
4792 treated as non-atomic.
4793
4794 A memory synchronization scope wider than work-group is not meaningful for the
4795 group (LDS) address space and is treated as work-group.
4796
4797 The memory model does not support the region address space which is treated as
4798 non-atomic.
4799
4800 Acquire memory ordering is not meaningful on store atomic instructions and is
4801 treated as non-atomic.
4802
4803 Release memory ordering is not meaningful on load atomic instructions and is
4804 treated a non-atomic.
4805
4806 Acquire-release memory ordering is not meaningful on load or store atomic
4807 instructions and is treated as acquire and release respectively.
4808
4809 The memory order also adds the single thread optimization constraints defined in
4810 table
4811 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
4812
4813   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
4814      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
4815
4816      ============ ==============================================================
4817      LLVM Memory  Optimization Constraints
4818      Ordering
4819      ============ ==============================================================
4820      unordered    *none*
4821      monotonic    *none*
4822      acquire      - If a load atomic/atomicrmw then no following load/load
4823                     atomic/store/store atomic/atomicrmw/fence instruction can be
4824                     moved before the acquire.
4825                   - If a fence then same as load atomic, plus no preceding
4826                     associated fence-paired-atomic can be moved after the fence.
4827      release      - If a store atomic/atomicrmw then no preceding load/load
4828                     atomic/store/store atomic/atomicrmw/fence instruction can be
4829                     moved after the release.
4830                   - If a fence then same as store atomic, plus no following
4831                     associated fence-paired-atomic can be moved before the
4832                     fence.
4833      acq_rel      Same constraints as both acquire and release.
4834      seq_cst      - If a load atomic then same constraints as acquire, plus no
4835                     preceding sequentially consistent load atomic/store
4836                     atomic/atomicrmw/fence instruction can be moved after the
4837                     seq_cst.
4838                   - If a store atomic then the same constraints as release, plus
4839                     no following sequentially consistent load atomic/store
4840                     atomic/atomicrmw/fence instruction can be moved before the
4841                     seq_cst.
4842                   - If an atomicrmw/fence then same constraints as acq_rel.
4843      ============ ==============================================================
4844
4845 The code sequences used to implement the memory model are defined in the
4846 following sections:
4847
4848 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
4849 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
4850 * :ref:`amdgpu-amdhsa-memory-model-gfx10`
4851
4852 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
4853
4854 Memory Model GFX6-GFX9
4855 ++++++++++++++++++++++
4856
4857 For GFX6-GFX9:
4858
4859 * Each agent has multiple shader arrays (SA).
4860 * Each SA has multiple compute units (CU).
4861 * Each CU has multiple SIMDs that execute wavefronts.
4862 * The wavefronts for a single work-group are executed in the same CU but may be
4863   executed by different SIMDs.
4864 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
4865   executing on it.
4866 * All LDS operations of a CU are performed as wavefront wide operations in a
4867   global order and involve no caching. Completion is reported to a wavefront in
4868   execution order.
4869 * The LDS memory has multiple request queues shared by the SIMDs of a
4870   CU. Therefore, the LDS operations performed by different wavefronts of a
4871   work-group can be reordered relative to each other, which can result in
4872   reordering the visibility of vector memory operations with respect to LDS
4873   operations of other wavefronts in the same work-group. A ``s_waitcnt
4874   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
4875   vector memory operations between wavefronts of a work-group, but not between
4876   operations performed by the same wavefront.
4877 * The vector memory operations are performed as wavefront wide operations and
4878   completion is reported to a wavefront in execution order. The exception is
4879   that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
4880   vector memory order if they access LDS memory, and out of LDS operation order
4881   if they access global memory.
4882 * The vector memory operations access a single vector L1 cache shared by all
4883   SIMDs a CU. Therefore, no special action is required for coherence between the
4884   lanes of a single wavefront, or for coherence between wavefronts in the same
4885   work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
4886   wavefronts executing in different work-groups as they may be executing on
4887   different CUs.
4888 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
4889   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
4890   scalar operations are used in a restricted way so do not impact the memory
4891   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
4892 * The vector and scalar memory operations use an L2 cache shared by all CUs on
4893   the same agent.
4894 * The L2 cache has independent channels to service disjoint ranges of virtual
4895   addresses.
4896 * Each CU has a separate request queue per channel. Therefore, the vector and
4897   scalar memory operations performed by wavefronts executing in different
4898   work-groups (which may be executing on different CUs) of an agent can be
4899   reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
4900   ensure synchronization between vector memory operations of different CUs. It
4901   ensures a previous vector memory operation has completed before executing a
4902   subsequent vector memory or LDS operation and so can be used to meet the
4903   requirements of acquire and release.
4904 * The L2 cache can be kept coherent with other agents on some targets, or ranges
4905   of virtual addresses can be set up to bypass it to ensure system coherence.
4906
4907 Scalar memory operations are only used to access memory that is proven to not
4908 change during the execution of the kernel dispatch. This includes constant
4909 address space and global address space for program scope ``const`` variables.
4910 Therefore, the kernel machine code does not have to maintain the scalar cache to
4911 ensure it is coherent with the vector caches. The scalar and vector caches are
4912 invalidated between kernel dispatches by CP since constant address space data
4913 may change between kernel dispatch executions. See
4914 :ref:`amdgpu-amdhsa-memory-spaces`.
4915
4916 The one exception is if scalar writes are used to spill SGPR registers. In this
4917 case the AMDGPU backend ensures the memory location used to spill is never
4918 accessed by vector memory operations at the same time. If scalar writes are used
4919 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
4920 return since the locations may be used for vector memory instructions by a
4921 future wavefront that uses the same scratch area, or a function call that
4922 creates a frame at the same address, respectively. There is no need for a
4923 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
4924
4925 For kernarg backing memory:
4926
4927 * CP invalidates the L1 cache at the start of each kernel dispatch.
4928 * On dGPU the kernarg backing memory is allocated in host memory accessed as
4929   MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
4930   causes it to be treated as non-volatile and so is not invalidated by
4931   ``*_vol``.
4932 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
4933   and so the L2 cache will be coherent with the CPU and other agents.
4934
4935 Scratch backing memory (which is used for the private address space) is accessed
4936 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
4937 only accessed by a single thread, and is always write-before-read, there is
4938 never a need to invalidate these entries from the L1 cache. Hence all cache
4939 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
4940
4941 The code sequences used to implement the memory model for GFX6-GFX9 are defined
4942 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
4943
4944   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
4945      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
4946
4947      ============ ============ ============== ========== ================================
4948      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
4949                   Ordering     Sync Scope     Address    GFX6-GFX9
4950                                               Space
4951      ============ ============ ============== ========== ================================
4952      **Non-Atomic**
4953      ------------------------------------------------------------------------------------
4954      load         *none*       *none*         - global   - !volatile & !nontemporal
4955                                               - generic
4956                                               - private    1. buffer/global/flat_load
4957                                               - constant
4958                                                          - !volatile & nontemporal
4959
4960                                                            1. buffer/global/flat_load
4961                                                               glc=1 slc=1
4962
4963                                                          - volatile
4964
4965                                                            1. buffer/global/flat_load
4966                                                               glc=1
4967                                                            2. s_waitcnt vmcnt(0)
4968
4969                                                             - Must happen before
4970                                                               any following volatile
4971                                                               global/generic
4972                                                               load/store.
4973                                                             - Ensures that
4974                                                               volatile
4975                                                               operations to
4976                                                               different
4977                                                               addresses will not
4978                                                               be reordered by
4979                                                               hardware.
4980
4981      load         *none*       *none*         - local    1. ds_load
4982      store        *none*       *none*         - global   - !volatile & !nontemporal
4983                                               - generic
4984                                               - private    1. buffer/global/flat_store
4985                                               - constant
4986                                                          - !volatile & nontemporal
4987
4988                                                            1. buffer/global/flat_store
4989                                                               glc=1 slc=1
4990
4991                                                          - volatile
4992
4993                                                            1. buffer/global/flat_store
4994                                                            2. s_waitcnt vmcnt(0)
4995
4996                                                             - Must happen before
4997                                                               any following volatile
4998                                                               global/generic
4999                                                               load/store.
5000                                                             - Ensures that
5001                                                               volatile
5002                                                               operations to
5003                                                               different
5004                                                               addresses will not
5005                                                               be reordered by
5006                                                               hardware.
5007
5008      store        *none*       *none*         - local    1. ds_store
5009      **Unordered Atomic**
5010      ------------------------------------------------------------------------------------
5011      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
5012      store atomic unordered    *any*          *any*      *Same as non-atomic*.
5013      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
5014      **Monotonic Atomic**
5015      ------------------------------------------------------------------------------------
5016      load atomic  monotonic    - singlethread - global   1. buffer/global/ds/flat_load
5017                                - wavefront    - local
5018                                - workgroup    - generic
5019      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
5020                                - system       - generic     glc=1
5021      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
5022                                - wavefront    - generic
5023                                - workgroup
5024                                - agent
5025                                - system
5026      store atomic monotonic    - singlethread - local    1. ds_store
5027                                - wavefront
5028                                - workgroup
5029      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
5030                                - wavefront    - generic
5031                                - workgroup
5032                                - agent
5033                                - system
5034      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
5035                                - wavefront
5036                                - workgroup
5037      **Acquire Atomic**
5038      ------------------------------------------------------------------------------------
5039      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
5040                                - wavefront    - local
5041                                               - generic
5042      load atomic  acquire      - workgroup    - global   1. buffer/global_load
5043      load atomic  acquire      - workgroup    - local    1. ds/flat_load
5044                                               - generic  2. s_waitcnt lgkmcnt(0)
5045
5046                                                            - If OpenCL, omit.
5047                                                            - Must happen before
5048                                                              any following
5049                                                              global/generic
5050                                                              load/load
5051                                                              atomic/store/store
5052                                                              atomic/atomicrmw.
5053                                                            - Ensures any
5054                                                              following global
5055                                                              data read is no
5056                                                              older than a local load
5057                                                              atomic value being
5058                                                              acquired.
5059
5060      load atomic  acquire      - agent        - global   1. buffer/global_load
5061                                - system                     glc=1
5062                                                          2. s_waitcnt vmcnt(0)
5063
5064                                                            - Must happen before
5065                                                              following
5066                                                              buffer_wbinvl1_vol.
5067                                                            - Ensures the load
5068                                                              has completed
5069                                                              before invalidating
5070                                                              the cache.
5071
5072                                                          3. buffer_wbinvl1_vol
5073
5074                                                            - Must happen before
5075                                                              any following
5076                                                              global/generic
5077                                                              load/load
5078                                                              atomic/atomicrmw.
5079                                                            - Ensures that
5080                                                              following
5081                                                              loads will not see
5082                                                              stale global data.
5083
5084      load atomic  acquire      - agent        - generic  1. flat_load glc=1
5085                                - system                  2. s_waitcnt vmcnt(0) &
5086                                                             lgkmcnt(0)
5087
5088                                                            - If OpenCL omit
5089                                                              lgkmcnt(0).
5090                                                            - Must happen before
5091                                                              following
5092                                                              buffer_wbinvl1_vol.
5093                                                            - Ensures the flat_load
5094                                                              has completed
5095                                                              before invalidating
5096                                                              the cache.
5097
5098                                                          3. buffer_wbinvl1_vol
5099
5100                                                            - Must happen before
5101                                                              any following
5102                                                              global/generic
5103                                                              load/load
5104                                                              atomic/atomicrmw.
5105                                                            - Ensures that
5106                                                              following loads
5107                                                              will not see stale
5108                                                              global data.
5109
5110      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
5111                                - wavefront    - local
5112                                               - generic
5113      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
5114      atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
5115                                               - generic  2. s_waitcnt lgkmcnt(0)
5116
5117                                                            - If OpenCL, omit.
5118                                                            - Must happen before
5119                                                              any following
5120                                                              global/generic
5121                                                              load/load
5122                                                              atomic/store/store
5123                                                              atomic/atomicrmw.
5124                                                            - Ensures any
5125                                                              following global
5126                                                              data read is no
5127                                                              older than a local
5128                                                              atomicrmw value
5129                                                              being acquired.
5130
5131      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
5132                                - system                  2. s_waitcnt vmcnt(0)
5133
5134                                                            - Must happen before
5135                                                              following
5136                                                              buffer_wbinvl1_vol.
5137                                                            - Ensures the
5138                                                              atomicrmw has
5139                                                              completed before
5140                                                              invalidating the
5141                                                              cache.
5142
5143                                                          3. buffer_wbinvl1_vol
5144
5145                                                            - Must happen before
5146                                                              any following
5147                                                              global/generic
5148                                                              load/load
5149                                                              atomic/atomicrmw.
5150                                                            - Ensures that
5151                                                              following loads
5152                                                              will not see stale
5153                                                              global data.
5154
5155      atomicrmw    acquire      - agent        - generic  1. flat_atomic
5156                                - system                  2. s_waitcnt vmcnt(0) &
5157                                                             lgkmcnt(0)
5158
5159                                                            - If OpenCL, omit
5160                                                              lgkmcnt(0).
5161                                                            - Must happen before
5162                                                              following
5163                                                              buffer_wbinvl1_vol.
5164                                                            - Ensures the
5165                                                              atomicrmw has
5166                                                              completed before
5167                                                              invalidating the
5168                                                              cache.
5169
5170                                                          3. buffer_wbinvl1_vol
5171
5172                                                            - Must happen before
5173                                                              any following
5174                                                              global/generic
5175                                                              load/load
5176                                                              atomic/atomicrmw.
5177                                                            - Ensures that
5178                                                              following loads
5179                                                              will not see stale
5180                                                              global data.
5181
5182      fence        acquire      - singlethread *none*     *none*
5183                                - wavefront
5184      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5185
5186                                                            - If OpenCL and
5187                                                              address space is
5188                                                              not generic, omit.
5189                                                            - However, since LLVM
5190                                                              currently has no
5191                                                              address space on
5192                                                              the fence need to
5193                                                              conservatively
5194                                                              always generate. If
5195                                                              fence had an
5196                                                              address space then
5197                                                              set to address
5198                                                              space of OpenCL
5199                                                              fence flag, or to
5200                                                              generic if both
5201                                                              local and global
5202                                                              flags are
5203                                                              specified.
5204                                                            - Must happen after
5205                                                              any preceding
5206                                                              local/generic load
5207                                                              atomic/atomicrmw
5208                                                              with an equal or
5209                                                              wider sync scope
5210                                                              and memory ordering
5211                                                              stronger than
5212                                                              unordered (this is
5213                                                              termed the
5214                                                              fence-paired-atomic).
5215                                                            - Must happen before
5216                                                              any following
5217                                                              global/generic
5218                                                              load/load
5219                                                              atomic/store/store
5220                                                              atomic/atomicrmw.
5221                                                            - Ensures any
5222                                                              following global
5223                                                              data read is no
5224                                                              older than the
5225                                                              value read by the
5226                                                              fence-paired-atomic.
5227
5228      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5229                                - system                     vmcnt(0)
5230
5231                                                            - If OpenCL and
5232                                                              address space is
5233                                                              not generic, omit
5234                                                              lgkmcnt(0).
5235                                                            - However, since LLVM
5236                                                              currently has no
5237                                                              address space on
5238                                                              the fence need to
5239                                                              conservatively
5240                                                              always generate
5241                                                              (see comment for
5242                                                              previous fence).
5243                                                            - Could be split into
5244                                                              separate s_waitcnt
5245                                                              vmcnt(0) and
5246                                                              s_waitcnt
5247                                                              lgkmcnt(0) to allow
5248                                                              them to be
5249                                                              independently moved
5250                                                              according to the
5251                                                              following rules.
5252                                                            - s_waitcnt vmcnt(0)
5253                                                              must happen after
5254                                                              any preceding
5255                                                              global/generic load
5256                                                              atomic/atomicrmw
5257                                                              with an equal or
5258                                                              wider sync scope
5259                                                              and memory ordering
5260                                                              stronger than
5261                                                              unordered (this is
5262                                                              termed the
5263                                                              fence-paired-atomic).
5264                                                            - s_waitcnt lgkmcnt(0)
5265                                                              must happen after
5266                                                              any preceding
5267                                                              local/generic load
5268                                                              atomic/atomicrmw
5269                                                              with an equal or
5270                                                              wider sync scope
5271                                                              and memory ordering
5272                                                              stronger than
5273                                                              unordered (this is
5274                                                              termed the
5275                                                              fence-paired-atomic).
5276                                                            - Must happen before
5277                                                              the following
5278                                                              buffer_wbinvl1_vol.
5279                                                            - Ensures that the
5280                                                              fence-paired atomic
5281                                                              has completed
5282                                                              before invalidating
5283                                                              the
5284                                                              cache. Therefore
5285                                                              any following
5286                                                              locations read must
5287                                                              be no older than
5288                                                              the value read by
5289                                                              the
5290                                                              fence-paired-atomic.
5291
5292                                                          2. buffer_wbinvl1_vol
5293
5294                                                            - Must happen before any
5295                                                              following global/generic
5296                                                              load/load
5297                                                              atomic/store/store
5298                                                              atomic/atomicrmw.
5299                                                            - Ensures that
5300                                                              following loads
5301                                                              will not see stale
5302                                                              global data.
5303
5304      **Release Atomic**
5305      ------------------------------------------------------------------------------------
5306      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
5307                                - wavefront    - local
5308                                               - generic
5309      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5310                                               - generic
5311                                                            - If OpenCL, omit.
5312                                                            - Must happen after
5313                                                              any preceding
5314                                                              local/generic
5315                                                              load/store/load
5316                                                              atomic/store
5317                                                              atomic/atomicrmw.
5318                                                            - Must happen before
5319                                                              the following
5320                                                              store.
5321                                                            - Ensures that all
5322                                                              memory operations
5323                                                              to local have
5324                                                              completed before
5325                                                              performing the
5326                                                              store that is being
5327                                                              released.
5328
5329                                                          2. buffer/global/flat_store
5330      store atomic release      - workgroup    - local    1. ds_store
5331      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5332                                - system       - generic     vmcnt(0)
5333
5334                                                            - If OpenCL and
5335                                                              address space is
5336                                                              not generic, omit
5337                                                              lgkmcnt(0).
5338                                                            - Could be split into
5339                                                              separate s_waitcnt
5340                                                              vmcnt(0) and
5341                                                              s_waitcnt
5342                                                              lgkmcnt(0) to allow
5343                                                              them to be
5344                                                              independently moved
5345                                                              according to the
5346                                                              following rules.
5347                                                            - s_waitcnt vmcnt(0)
5348                                                              must happen after
5349                                                              any preceding
5350                                                              global/generic
5351                                                              load/store/load
5352                                                              atomic/store
5353                                                              atomic/atomicrmw.
5354                                                            - s_waitcnt lgkmcnt(0)
5355                                                              must happen after
5356                                                              any preceding
5357                                                              local/generic
5358                                                              load/store/load
5359                                                              atomic/store
5360                                                              atomic/atomicrmw.
5361                                                            - Must happen before
5362                                                              the following
5363                                                              store.
5364                                                            - Ensures that all
5365                                                              memory operations
5366                                                              to memory have
5367                                                              completed before
5368                                                              performing the
5369                                                              store that is being
5370                                                              released.
5371
5372                                                          2. buffer/global/flat_store
5373      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
5374                                - wavefront    - local
5375                                               - generic
5376      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5377                                               - generic
5378                                                            - If OpenCL, omit.
5379                                                            - Must happen after
5380                                                              any preceding
5381                                                              local/generic
5382                                                              load/store/load
5383                                                              atomic/store
5384                                                              atomic/atomicrmw.
5385                                                            - Must happen before
5386                                                              the following
5387                                                              atomicrmw.
5388                                                            - Ensures that all
5389                                                              memory operations
5390                                                              to local have
5391                                                              completed before
5392                                                              performing the
5393                                                              atomicrmw that is
5394                                                              being released.
5395
5396                                                          2. buffer/global/flat_atomic
5397      atomicrmw    release      - workgroup    - local    1. ds_atomic
5398      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5399                                - system       - generic     vmcnt(0)
5400
5401                                                            - If OpenCL, omit
5402                                                              lgkmcnt(0).
5403                                                            - Could be split into
5404                                                              separate s_waitcnt
5405                                                              vmcnt(0) and
5406                                                              s_waitcnt
5407                                                              lgkmcnt(0) to allow
5408                                                              them to be
5409                                                              independently moved
5410                                                              according to the
5411                                                              following rules.
5412                                                            - s_waitcnt vmcnt(0)
5413                                                              must happen after
5414                                                              any preceding
5415                                                              global/generic
5416                                                              load/store/load
5417                                                              atomic/store
5418                                                              atomic/atomicrmw.
5419                                                            - s_waitcnt lgkmcnt(0)
5420                                                              must happen after
5421                                                              any preceding
5422                                                              local/generic
5423                                                              load/store/load
5424                                                              atomic/store
5425                                                              atomic/atomicrmw.
5426                                                            - Must happen before
5427                                                              the following
5428                                                              atomicrmw.
5429                                                            - Ensures that all
5430                                                              memory operations
5431                                                              to global and local
5432                                                              have completed
5433                                                              before performing
5434                                                              the atomicrmw that
5435                                                              is being released.
5436
5437                                                          2. buffer/global/flat_atomic
5438      fence        release      - singlethread *none*     *none*
5439                                - wavefront
5440      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5441
5442                                                            - If OpenCL and
5443                                                              address space is
5444                                                              not generic, omit.
5445                                                            - However, since LLVM
5446                                                              currently has no
5447                                                              address space on
5448                                                              the fence need to
5449                                                              conservatively
5450                                                              always generate. If
5451                                                              fence had an
5452                                                              address space then
5453                                                              set to address
5454                                                              space of OpenCL
5455                                                              fence flag, or to
5456                                                              generic if both
5457                                                              local and global
5458                                                              flags are
5459                                                              specified.
5460                                                            - Must happen after
5461                                                              any preceding
5462                                                              local/generic
5463                                                              load/load
5464                                                              atomic/store/store
5465                                                              atomic/atomicrmw.
5466                                                            - Must happen before
5467                                                              any following store
5468                                                              atomic/atomicrmw
5469                                                              with an equal or
5470                                                              wider sync scope
5471                                                              and memory ordering
5472                                                              stronger than
5473                                                              unordered (this is
5474                                                              termed the
5475                                                              fence-paired-atomic).
5476                                                            - Ensures that all
5477                                                              memory operations
5478                                                              to local have
5479                                                              completed before
5480                                                              performing the
5481                                                              following
5482                                                              fence-paired-atomic.
5483
5484      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5485                                - system                     vmcnt(0)
5486
5487                                                            - If OpenCL and
5488                                                              address space is
5489                                                              not generic, omit
5490                                                              lgkmcnt(0).
5491                                                            - If OpenCL and
5492                                                              address space is
5493                                                              local, omit
5494                                                              vmcnt(0).
5495                                                            - However, since LLVM
5496                                                              currently has no
5497                                                              address space on
5498                                                              the fence need to
5499                                                              conservatively
5500                                                              always generate. If
5501                                                              fence had an
5502                                                              address space then
5503                                                              set to address
5504                                                              space of OpenCL
5505                                                              fence flag, or to
5506                                                              generic if both
5507                                                              local and global
5508                                                              flags are
5509                                                              specified.
5510                                                            - Could be split into
5511                                                              separate s_waitcnt
5512                                                              vmcnt(0) and
5513                                                              s_waitcnt
5514                                                              lgkmcnt(0) to allow
5515                                                              them to be
5516                                                              independently moved
5517                                                              according to the
5518                                                              following rules.
5519                                                            - s_waitcnt vmcnt(0)
5520                                                              must happen after
5521                                                              any preceding
5522                                                              global/generic
5523                                                              load/store/load
5524                                                              atomic/store
5525                                                              atomic/atomicrmw.
5526                                                            - s_waitcnt lgkmcnt(0)
5527                                                              must happen after
5528                                                              any preceding
5529                                                              local/generic
5530                                                              load/store/load
5531                                                              atomic/store
5532                                                              atomic/atomicrmw.
5533                                                            - Must happen before
5534                                                              any following store
5535                                                              atomic/atomicrmw
5536                                                              with an equal or
5537                                                              wider sync scope
5538                                                              and memory ordering
5539                                                              stronger than
5540                                                              unordered (this is
5541                                                              termed the
5542                                                              fence-paired-atomic).
5543                                                            - Ensures that all
5544                                                              memory operations
5545                                                              have
5546                                                              completed before
5547                                                              performing the
5548                                                              following
5549                                                              fence-paired-atomic.
5550
5551      **Acquire-Release Atomic**
5552      ------------------------------------------------------------------------------------
5553      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
5554                                - wavefront    - local
5555                                               - generic
5556      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5557
5558                                                            - If OpenCL, omit.
5559                                                            - Must happen after
5560                                                              any preceding
5561                                                              local/generic
5562                                                              load/store/load
5563                                                              atomic/store
5564                                                              atomic/atomicrmw.
5565                                                            - Must happen before
5566                                                              the following
5567                                                              atomicrmw.
5568                                                            - Ensures that all
5569                                                              memory operations
5570                                                              to local have
5571                                                              completed before
5572                                                              performing the
5573                                                              atomicrmw that is
5574                                                              being released.
5575
5576                                                          2. buffer/global_atomic
5577
5578      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
5579                                                          2. s_waitcnt lgkmcnt(0)
5580
5581                                                            - If OpenCL, omit.
5582                                                            - Must happen before
5583                                                              any following
5584                                                              global/generic
5585                                                              load/load
5586                                                              atomic/store/store
5587                                                              atomic/atomicrmw.
5588                                                            - Ensures any
5589                                                              following global
5590                                                              data read is no
5591                                                              older than the local load
5592                                                              atomic value being
5593                                                              acquired.
5594
5595      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
5596
5597                                                            - If OpenCL, omit.
5598                                                            - Must happen after
5599                                                              any preceding
5600                                                              local/generic
5601                                                              load/store/load
5602                                                              atomic/store
5603                                                              atomic/atomicrmw.
5604                                                            - Must happen before
5605                                                              the following
5606                                                              atomicrmw.
5607                                                            - Ensures that all
5608                                                              memory operations
5609                                                              to local have
5610                                                              completed before
5611                                                              performing the
5612                                                              atomicrmw that is
5613                                                              being released.
5614
5615                                                          2. flat_atomic
5616                                                          3. s_waitcnt lgkmcnt(0)
5617
5618                                                            - If OpenCL, omit.
5619                                                            - Must happen before
5620                                                              any following
5621                                                              global/generic
5622                                                              load/load
5623                                                              atomic/store/store
5624                                                              atomic/atomicrmw.
5625                                                            - Ensures any
5626                                                              following global
5627                                                              data read is no
5628                                                              older than a local load
5629                                                              atomic value being
5630                                                              acquired.
5631
5632      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
5633                                - system                     vmcnt(0)
5634
5635                                                            - If OpenCL, omit
5636                                                              lgkmcnt(0).
5637                                                            - Could be split into
5638                                                              separate s_waitcnt
5639                                                              vmcnt(0) and
5640                                                              s_waitcnt
5641                                                              lgkmcnt(0) to allow
5642                                                              them to be
5643                                                              independently moved
5644                                                              according to the
5645                                                              following rules.
5646                                                            - s_waitcnt vmcnt(0)
5647                                                              must happen after
5648                                                              any preceding
5649                                                              global/generic
5650                                                              load/store/load
5651                                                              atomic/store
5652                                                              atomic/atomicrmw.
5653                                                            - s_waitcnt lgkmcnt(0)
5654                                                              must happen after
5655                                                              any preceding
5656                                                              local/generic
5657                                                              load/store/load
5658                                                              atomic/store
5659                                                              atomic/atomicrmw.
5660                                                            - Must happen before
5661                                                              the following
5662                                                              atomicrmw.
5663                                                            - Ensures that all
5664                                                              memory operations
5665                                                              to global have
5666                                                              completed before
5667                                                              performing the
5668                                                              atomicrmw that is
5669                                                              being released.
5670
5671                                                          2. buffer/global_atomic
5672                                                          3. s_waitcnt vmcnt(0)
5673
5674                                                            - Must happen before
5675                                                              following
5676                                                              buffer_wbinvl1_vol.
5677                                                            - Ensures the
5678                                                              atomicrmw has
5679                                                              completed before
5680                                                              invalidating the
5681                                                              cache.
5682
5683                                                          4. buffer_wbinvl1_vol
5684
5685                                                            - Must happen before
5686                                                              any following
5687                                                              global/generic
5688                                                              load/load
5689                                                              atomic/atomicrmw.
5690                                                            - Ensures that
5691                                                              following loads
5692                                                              will not see stale
5693                                                              global data.
5694
5695      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
5696                                - system                     vmcnt(0)
5697
5698                                                            - If OpenCL, omit
5699                                                              lgkmcnt(0).
5700                                                            - Could be split into
5701                                                              separate s_waitcnt
5702                                                              vmcnt(0) and
5703                                                              s_waitcnt
5704                                                              lgkmcnt(0) to allow
5705                                                              them to be
5706                                                              independently moved
5707                                                              according to the
5708                                                              following rules.
5709                                                            - s_waitcnt vmcnt(0)
5710                                                              must happen after
5711                                                              any preceding
5712                                                              global/generic
5713                                                              load/store/load
5714                                                              atomic/store
5715                                                              atomic/atomicrmw.
5716                                                            - s_waitcnt lgkmcnt(0)
5717                                                              must happen after
5718                                                              any preceding
5719                                                              local/generic
5720                                                              load/store/load
5721                                                              atomic/store
5722                                                              atomic/atomicrmw.
5723                                                            - Must happen before
5724                                                              the following
5725                                                              atomicrmw.
5726                                                            - Ensures that all
5727                                                              memory operations
5728                                                              to global have
5729                                                              completed before
5730                                                              performing the
5731                                                              atomicrmw that is
5732                                                              being released.
5733
5734                                                          2. flat_atomic
5735                                                          3. s_waitcnt vmcnt(0) &
5736                                                             lgkmcnt(0)
5737
5738                                                            - If OpenCL, omit
5739                                                              lgkmcnt(0).
5740                                                            - Must happen before
5741                                                              following
5742                                                              buffer_wbinvl1_vol.
5743                                                            - Ensures the
5744                                                              atomicrmw has
5745                                                              completed before
5746                                                              invalidating the
5747                                                              cache.
5748
5749                                                          4. buffer_wbinvl1_vol
5750
5751                                                            - Must happen before
5752                                                              any following
5753                                                              global/generic
5754                                                              load/load
5755                                                              atomic/atomicrmw.
5756                                                            - Ensures that
5757                                                              following loads
5758                                                              will not see stale
5759                                                              global data.
5760
5761      fence        acq_rel      - singlethread *none*     *none*
5762                                - wavefront
5763      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
5764
5765                                                            - If OpenCL and
5766                                                              address space is
5767                                                              not generic, omit.
5768                                                            - However,
5769                                                              since LLVM
5770                                                              currently has no
5771                                                              address space on
5772                                                              the fence need to
5773                                                              conservatively
5774                                                              always generate
5775                                                              (see comment for
5776                                                              previous fence).
5777                                                            - Must happen after
5778                                                              any preceding
5779                                                              local/generic
5780                                                              load/load
5781                                                              atomic/store/store
5782                                                              atomic/atomicrmw.
5783                                                            - Must happen before
5784                                                              any following
5785                                                              global/generic
5786                                                              load/load
5787                                                              atomic/store/store
5788                                                              atomic/atomicrmw.
5789                                                            - Ensures that all
5790                                                              memory operations
5791                                                              to local have
5792                                                              completed before
5793                                                              performing any
5794                                                              following global
5795                                                              memory operations.
5796                                                            - Ensures that the
5797                                                              preceding
5798                                                              local/generic load
5799                                                              atomic/atomicrmw
5800                                                              with an equal or
5801                                                              wider sync scope
5802                                                              and memory ordering
5803                                                              stronger than
5804                                                              unordered (this is
5805                                                              termed the
5806                                                              acquire-fence-paired-atomic)
5807                                                              has completed
5808                                                              before following
5809                                                              global memory
5810                                                              operations. This
5811                                                              satisfies the
5812                                                              requirements of
5813                                                              acquire.
5814                                                            - Ensures that all
5815                                                              previous memory
5816                                                              operations have
5817                                                              completed before a
5818                                                              following
5819                                                              local/generic store
5820                                                              atomic/atomicrmw
5821                                                              with an equal or
5822                                                              wider sync scope
5823                                                              and memory ordering
5824                                                              stronger than
5825                                                              unordered (this is
5826                                                              termed the
5827                                                              release-fence-paired-atomic).
5828                                                              This satisfies the
5829                                                              requirements of
5830                                                              release.
5831
5832      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
5833                                - system                     vmcnt(0)
5834
5835                                                            - If OpenCL and
5836                                                              address space is
5837                                                              not generic, omit
5838                                                              lgkmcnt(0).
5839                                                            - However, since LLVM
5840                                                              currently has no
5841                                                              address space on
5842                                                              the fence need to
5843                                                              conservatively
5844                                                              always generate
5845                                                              (see comment for
5846                                                              previous fence).
5847                                                            - Could be split into
5848                                                              separate s_waitcnt
5849                                                              vmcnt(0) and
5850                                                              s_waitcnt
5851                                                              lgkmcnt(0) to allow
5852                                                              them to be
5853                                                              independently moved
5854                                                              according to the
5855                                                              following rules.
5856                                                            - s_waitcnt vmcnt(0)
5857                                                              must happen after
5858                                                              any preceding
5859                                                              global/generic
5860                                                              load/store/load
5861                                                              atomic/store
5862                                                              atomic/atomicrmw.
5863                                                            - s_waitcnt lgkmcnt(0)
5864                                                              must happen after
5865                                                              any preceding
5866                                                              local/generic
5867                                                              load/store/load
5868                                                              atomic/store
5869                                                              atomic/atomicrmw.
5870                                                            - Must happen before
5871                                                              the following
5872                                                              buffer_wbinvl1_vol.
5873                                                            - Ensures that the
5874                                                              preceding
5875                                                              global/local/generic
5876                                                              load
5877                                                              atomic/atomicrmw
5878                                                              with an equal or
5879                                                              wider sync scope
5880                                                              and memory ordering
5881                                                              stronger than
5882                                                              unordered (this is
5883                                                              termed the
5884                                                              acquire-fence-paired-atomic)
5885                                                              has completed
5886                                                              before invalidating
5887                                                              the cache. This
5888                                                              satisfies the
5889                                                              requirements of
5890                                                              acquire.
5891                                                            - Ensures that all
5892                                                              previous memory
5893                                                              operations have
5894                                                              completed before a
5895                                                              following
5896                                                              global/local/generic
5897                                                              store
5898                                                              atomic/atomicrmw
5899                                                              with an equal or
5900                                                              wider sync scope
5901                                                              and memory ordering
5902                                                              stronger than
5903                                                              unordered (this is
5904                                                              termed the
5905                                                              release-fence-paired-atomic).
5906                                                              This satisfies the
5907                                                              requirements of
5908                                                              release.
5909
5910                                                          2. buffer_wbinvl1_vol
5911
5912                                                            - Must happen before
5913                                                              any following
5914                                                              global/generic
5915                                                              load/load
5916                                                              atomic/store/store
5917                                                              atomic/atomicrmw.
5918                                                            - Ensures that
5919                                                              following loads
5920                                                              will not see stale
5921                                                              global data. This
5922                                                              satisfies the
5923                                                              requirements of
5924                                                              acquire.
5925
5926      **Sequential Consistent Atomic**
5927      ------------------------------------------------------------------------------------
5928      load atomic  seq_cst      - singlethread - global   *Same as corresponding
5929                                - wavefront    - local    load atomic acquire,
5930                                               - generic  except must generate
5931                                                          all instructions even
5932                                                          for OpenCL.*
5933      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
5934                                               - generic
5935
5936                                                            - Must
5937                                                              happen after
5938                                                              preceding
5939                                                              local/generic load
5940                                                              atomic/store
5941                                                              atomic/atomicrmw
5942                                                              with memory
5943                                                              ordering of seq_cst
5944                                                              and with equal or
5945                                                              wider sync scope.
5946                                                              (Note that seq_cst
5947                                                              fences have their
5948                                                              own s_waitcnt
5949                                                              lgkmcnt(0) and so do
5950                                                              not need to be
5951                                                              considered.)
5952                                                            - Ensures any
5953                                                              preceding
5954                                                              sequential
5955                                                              consistent local
5956                                                              memory instructions
5957                                                              have completed
5958                                                              before executing
5959                                                              this sequentially
5960                                                              consistent
5961                                                              instruction. This
5962                                                              prevents reordering
5963                                                              a seq_cst store
5964                                                              followed by a
5965                                                              seq_cst load. (Note
5966                                                              that seq_cst is
5967                                                              stronger than
5968                                                              acquire/release as
5969                                                              the reordering of
5970                                                              load acquire
5971                                                              followed by a store
5972                                                              release is
5973                                                              prevented by the
5974                                                              s_waitcnt of
5975                                                              the release, but
5976                                                              there is nothing
5977                                                              preventing a store
5978                                                              release followed by
5979                                                              load acquire from
5980                                                              completing out of
5981                                                              order. The s_waitcnt
5982                                                              could be placed after
5983                                                              seq_store or before
5984                                                              the seq_load. We
5985                                                              choose the load to
5986                                                              make the s_waitcnt be
5987                                                              as late as possible
5988                                                              so that the store
5989                                                              may have already
5990                                                              completed.)
5991
5992                                                          2. *Following
5993                                                             instructions same as
5994                                                             corresponding load
5995                                                             atomic acquire,
5996                                                             except must generate
5997                                                             all instructions even
5998                                                             for OpenCL.*
5999      load atomic  seq_cst      - workgroup    - local    *Same as corresponding
6000                                                          load atomic acquire,
6001                                                          except must generate
6002                                                          all instructions even
6003                                                          for OpenCL.*
6004
6005      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
6006                                - system       - generic     vmcnt(0)
6007
6008                                                            - Could be split into
6009                                                              separate s_waitcnt
6010                                                              vmcnt(0)
6011                                                              and s_waitcnt
6012                                                              lgkmcnt(0) to allow
6013                                                              them to be
6014                                                              independently moved
6015                                                              according to the
6016                                                              following rules.
6017                                                            - s_waitcnt lgkmcnt(0)
6018                                                              must happen after
6019                                                              preceding
6020                                                              global/generic load
6021                                                              atomic/store
6022                                                              atomic/atomicrmw
6023                                                              with memory
6024                                                              ordering of seq_cst
6025                                                              and with equal or
6026                                                              wider sync scope.
6027                                                              (Note that seq_cst
6028                                                              fences have their
6029                                                              own s_waitcnt
6030                                                              lgkmcnt(0) and so do
6031                                                              not need to be
6032                                                              considered.)
6033                                                            - s_waitcnt vmcnt(0)
6034                                                              must happen after
6035                                                              preceding
6036                                                              global/generic load
6037                                                              atomic/store
6038                                                              atomic/atomicrmw
6039                                                              with memory
6040                                                              ordering of seq_cst
6041                                                              and with equal or
6042                                                              wider sync scope.
6043                                                              (Note that seq_cst
6044                                                              fences have their
6045                                                              own s_waitcnt
6046                                                              vmcnt(0) and so do
6047                                                              not need to be
6048                                                              considered.)
6049                                                            - Ensures any
6050                                                              preceding
6051                                                              sequential
6052                                                              consistent global
6053                                                              memory instructions
6054                                                              have completed
6055                                                              before executing
6056                                                              this sequentially
6057                                                              consistent
6058                                                              instruction. This
6059                                                              prevents reordering
6060                                                              a seq_cst store
6061                                                              followed by a
6062                                                              seq_cst load. (Note
6063                                                              that seq_cst is
6064                                                              stronger than
6065                                                              acquire/release as
6066                                                              the reordering of
6067                                                              load acquire
6068                                                              followed by a store
6069                                                              release is
6070                                                              prevented by the
6071                                                              s_waitcnt of
6072                                                              the release, but
6073                                                              there is nothing
6074                                                              preventing a store
6075                                                              release followed by
6076                                                              load acquire from
6077                                                              completing out of
6078                                                              order. The s_waitcnt
6079                                                              could be placed after
6080                                                              seq_store or before
6081                                                              the seq_load. We
6082                                                              choose the load to
6083                                                              make the s_waitcnt be
6084                                                              as late as possible
6085                                                              so that the store
6086                                                              may have already
6087                                                              completed.)
6088
6089                                                          2. *Following
6090                                                             instructions same as
6091                                                             corresponding load
6092                                                             atomic acquire,
6093                                                             except must generate
6094                                                             all instructions even
6095                                                             for OpenCL.*
6096      store atomic seq_cst      - singlethread - global   *Same as corresponding
6097                                - wavefront    - local    store atomic release,
6098                                - workgroup    - generic  except must generate
6099                                - agent                   all instructions even
6100                                - system                  for OpenCL.*
6101      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
6102                                - wavefront    - local    atomicrmw acq_rel,
6103                                - workgroup    - generic  except must generate
6104                                - agent                   all instructions even
6105                                - system                  for OpenCL.*
6106      fence        seq_cst      - singlethread *none*     *Same as corresponding
6107                                - wavefront               fence acq_rel,
6108                                - workgroup               except must generate
6109                                - agent                   all instructions even
6110                                - system                  for OpenCL.*
6111      ============ ============ ============== ========== ================================
6112
6113 .. _amdgpu-amdhsa-memory-model-gfx90a:
6114
6115 Memory Model GFX90A
6116 +++++++++++++++++++
6117
6118 For GFX90A:
6119
6120 * Each agent has multiple shader arrays (SA).
6121 * Each SA has multiple compute units (CU).
6122 * Each CU has multiple SIMDs that execute wavefronts.
6123 * The wavefronts for a single work-group are executed in the same CU but may be
6124   executed by different SIMDs. The exception is when in tgsplit execution mode
6125   when the wavefronts may be executed by different SIMDs in different CUs.
6126 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
6127   executing on it. The exception is when in tgsplit execution mode when no LDS
6128   is allocated as wavefronts of the same work-group can be in different CUs.
6129 * All LDS operations of a CU are performed as wavefront wide operations in a
6130   global order and involve no caching. Completion is reported to a wavefront in
6131   execution order.
6132 * The LDS memory has multiple request queues shared by the SIMDs of a
6133   CU. Therefore, the LDS operations performed by different wavefronts of a
6134   work-group can be reordered relative to each other, which can result in
6135   reordering the visibility of vector memory operations with respect to LDS
6136   operations of other wavefronts in the same work-group. A ``s_waitcnt
6137   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6138   vector memory operations between wavefronts of a work-group, but not between
6139   operations performed by the same wavefront.
6140 * The vector memory operations are performed as wavefront wide operations and
6141   completion is reported to a wavefront in execution order. The exception is
6142   that ``flat_load/store/atomic`` instructions can report out of vector memory
6143   order if they access LDS memory, and out of LDS operation order if they access
6144   global memory.
6145 * The vector memory operations access a single vector L1 cache shared by all
6146   SIMDs a CU. Therefore:
6147
6148   * No special action is required for coherence between the lanes of a single
6149     wavefront.
6150
6151   * No special action is required for coherence between wavefronts in the same
6152     work-group since they execute on the same CU. The exception is when in
6153     tgsplit execution mode as wavefronts of the same work-group can be in
6154     different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6155     the following item.
6156
6157   * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6158     executing in different work-groups as they may be executing on different
6159     CUs.
6160
6161 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
6162   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6163   scalar operations are used in a restricted way so do not impact the memory
6164   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6165 * The vector and scalar memory operations use an L2 cache shared by all CUs on
6166   the same agent.
6167
6168   * The L2 cache has independent channels to service disjoint ranges of virtual
6169     addresses.
6170   * Each CU has a separate request queue per channel. Therefore, the vector and
6171     scalar memory operations performed by wavefronts executing in different
6172     work-groups (which may be executing on different CUs), or the same
6173     work-group if executing in tgsplit mode, of an agent can be reordered
6174     relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6175     synchronization between vector memory operations of different CUs. It
6176     ensures a previous vector memory operation has completed before executing a
6177     subsequent vector memory or LDS operation and so can be used to meet the
6178     requirements of acquire and release.
6179   * The L2 cache of one agent can be kept coherent with other agents by:
6180     using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6181     C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6182     the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6183
6184     * Any local memory cache lines will be automatically invalidated by writes
6185       from CUs associated with other L2 caches, or writes from the CPU, due to
6186       the cache probe caused by coherent requests. Coherent requests are caused
6187       by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6188       XGMI, and by PCIe requests that are configured to be coherent requests.
6189     * XGMI accesses from the CPU to local memory may be cached on the CPU.
6190       Subsequent access from the GPU will automatically invalidate or writeback
6191       the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6192     * Since all work-groups on the same agent share the same L2, no L2
6193       invalidation or writeback is required for coherence.
6194     * To ensure coherence of local and remote memory writes of work-groups in
6195       different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6196       cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6197       ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6198       fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6199       remote fine grain memory) bypasses the L2, so both will never result in
6200       dirty L2 cache lines.
6201     * To ensure coherence of local and remote memory reads of work-groups in
6202       different agents a ``buffer_invl2`` is required. It will invalidate L2
6203       cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6204       MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6205       coarse memory) cause local reads to be invalidated by remote writes with
6206       with the PTE C-bit so these cache lines are not invalidated. Note that
6207       MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6208       never result in L2 cache lines that need to be invalidated.
6209
6210   * PCIe access from the GPU to the CPU memory is kept coherent by using the
6211     MTYPE UC (uncached) which bypasses the L2.
6212
6213 Scalar memory operations are only used to access memory that is proven to not
6214 change during the execution of the kernel dispatch. This includes constant
6215 address space and global address space for program scope ``const`` variables.
6216 Therefore, the kernel machine code does not have to maintain the scalar cache to
6217 ensure it is coherent with the vector caches. The scalar and vector caches are
6218 invalidated between kernel dispatches by CP since constant address space data
6219 may change between kernel dispatch executions. See
6220 :ref:`amdgpu-amdhsa-memory-spaces`.
6221
6222 The one exception is if scalar writes are used to spill SGPR registers. In this
6223 case the AMDGPU backend ensures the memory location used to spill is never
6224 accessed by vector memory operations at the same time. If scalar writes are used
6225 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6226 return since the locations may be used for vector memory instructions by a
6227 future wavefront that uses the same scratch area, or a function call that
6228 creates a frame at the same address, respectively. There is no need for a
6229 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6230
6231 For kernarg backing memory:
6232
6233 * CP invalidates the L1 cache at the start of each kernel dispatch.
6234 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6235   memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6236   cache. This also causes it to be treated as non-volatile and so is not
6237   invalidated by ``*_vol``.
6238 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6239   so the L2 cache will be coherent with the CPU and other agents.
6240
6241 Scratch backing memory (which is used for the private address space) is accessed
6242 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6243 only accessed by a single thread, and is always write-before-read, there is
6244 never a need to invalidate these entries from the L1 cache. Hence all cache
6245 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6246
6247 The code sequences used to implement the memory model for GFX90A are defined
6248 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6249
6250   .. table:: AMDHSA Memory Model Code Sequences GFX90A
6251      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6252
6253      ============ ============ ============== ========== ================================
6254      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
6255                   Ordering     Sync Scope     Address    GFX90A
6256                                               Space
6257      ============ ============ ============== ========== ================================
6258      **Non-Atomic**
6259      ------------------------------------------------------------------------------------
6260      load         *none*       *none*         - global   - !volatile & !nontemporal
6261                                               - generic
6262                                               - private    1. buffer/global/flat_load
6263                                               - constant
6264                                                          - !volatile & nontemporal
6265
6266                                                            1. buffer/global/flat_load
6267                                                               glc=1 slc=1
6268
6269                                                          - volatile
6270
6271                                                            1. buffer/global/flat_load
6272                                                               glc=1
6273                                                            2. s_waitcnt vmcnt(0)
6274
6275                                                             - Must happen before
6276                                                               any following volatile
6277                                                               global/generic
6278                                                               load/store.
6279                                                             - Ensures that
6280                                                               volatile
6281                                                               operations to
6282                                                               different
6283                                                               addresses will not
6284                                                               be reordered by
6285                                                               hardware.
6286
6287      load         *none*       *none*         - local    1. ds_load
6288      store        *none*       *none*         - global   - !volatile & !nontemporal
6289                                               - generic
6290                                               - private    1. buffer/global/flat_store
6291                                               - constant
6292                                                          - !volatile & nontemporal
6293
6294                                                            1. buffer/global/flat_store
6295                                                               glc=1 slc=1
6296
6297                                                          - volatile
6298
6299                                                            1. buffer/global/flat_store
6300                                                            2. s_waitcnt vmcnt(0)
6301
6302                                                             - Must happen before
6303                                                               any following volatile
6304                                                               global/generic
6305                                                               load/store.
6306                                                             - Ensures that
6307                                                               volatile
6308                                                               operations to
6309                                                               different
6310                                                               addresses will not
6311                                                               be reordered by
6312                                                               hardware.
6313
6314      store        *none*       *none*         - local    1. ds_store
6315      **Unordered Atomic**
6316      ------------------------------------------------------------------------------------
6317      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
6318      store atomic unordered    *any*          *any*      *Same as non-atomic*.
6319      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
6320      **Monotonic Atomic**
6321      ------------------------------------------------------------------------------------
6322      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
6323                                - wavefront    - generic
6324      load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
6325                                               - generic     glc=1
6326
6327                                                            - If not TgSplit execution
6328                                                              mode, omit glc=1.
6329
6330      load atomic  monotonic    - singlethread - local    *If TgSplit execution mode,
6331                                - wavefront               local address space cannot
6332                                - workgroup               be used.*
6333
6334                                                          1. ds_load
6335      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
6336                                               - generic     glc=1
6337      load atomic  monotonic    - system       - global   1. buffer/global/flat_load
6338                                               - generic     glc=1
6339      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
6340                                - wavefront    - generic
6341                                - workgroup
6342                                - agent
6343      store atomic monotonic    - system       - global   1. buffer/global/flat_store
6344                                               - generic
6345      store atomic monotonic    - singlethread - local    *If TgSplit execution mode,
6346                                - wavefront               local address space cannot
6347                                - workgroup               be used.*
6348
6349                                                          1. ds_store
6350      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
6351                                - wavefront    - generic
6352                                - workgroup
6353                                - agent
6354      atomicrmw    monotonic    - system       - global   1. buffer/global/flat_atomic
6355                                               - generic
6356      atomicrmw    monotonic    - singlethread - local    *If TgSplit execution mode,
6357                                - wavefront               local address space cannot
6358                                - workgroup               be used.*
6359
6360                                                          1. ds_atomic
6361      **Acquire Atomic**
6362      ------------------------------------------------------------------------------------
6363      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
6364                                - wavefront    - local
6365                                               - generic
6366      load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
6367
6368                                                            - If not TgSplit execution
6369                                                              mode, omit glc=1.
6370
6371                                                          2. s_waitcnt vmcnt(0)
6372
6373                                                            - If not TgSplit execution
6374                                                              mode, omit.
6375                                                            - Must happen before the
6376                                                              following buffer_wbinvl1_vol.
6377
6378                                                          3. buffer_wbinvl1_vol
6379
6380                                                            - If not TgSplit execution
6381                                                              mode, omit.
6382                                                            - Must happen before
6383                                                              any following
6384                                                              global/generic
6385                                                              load/load
6386                                                              atomic/store/store
6387                                                              atomic/atomicrmw.
6388                                                            - Ensures that
6389                                                              following
6390                                                              loads will not see
6391                                                              stale data.
6392
6393      load atomic  acquire      - workgroup    - local    *If TgSplit execution mode,
6394                                                          local address space cannot
6395                                                          be used.*
6396
6397                                                          1. ds_load
6398                                                          2. s_waitcnt lgkmcnt(0)
6399
6400                                                            - If OpenCL, omit.
6401                                                            - Must happen before
6402                                                              any following
6403                                                              global/generic
6404                                                              load/load
6405                                                              atomic/store/store
6406                                                              atomic/atomicrmw.
6407                                                            - Ensures any
6408                                                              following global
6409                                                              data read is no
6410                                                              older than the local load
6411                                                              atomic value being
6412                                                              acquired.
6413
6414      load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
6415
6416                                                            - If not TgSplit execution
6417                                                              mode, omit glc=1.
6418
6419                                                          2. s_waitcnt lgkm/vmcnt(0)
6420
6421                                                            - Use lgkmcnt(0) if not
6422                                                              TgSplit execution mode
6423                                                              and vmcnt(0) if TgSplit
6424                                                              execution mode.
6425                                                            - If OpenCL, omit lgkmcnt(0).
6426                                                            - Must happen before
6427                                                              the following
6428                                                              buffer_wbinvl1_vol and any
6429                                                              following global/generic
6430                                                              load/load
6431                                                              atomic/store/store
6432                                                              atomic/atomicrmw.
6433                                                            - Ensures any
6434                                                              following global
6435                                                              data read is no
6436                                                              older than a local load
6437                                                              atomic value being
6438                                                              acquired.
6439
6440                                                          3. buffer_wbinvl1_vol
6441
6442                                                            - If not TgSplit execution
6443                                                              mode, omit.
6444                                                            - Ensures that
6445                                                              following
6446                                                              loads will not see
6447                                                              stale data.
6448
6449      load atomic  acquire      - agent        - global   1. buffer/global_load
6450                                                             glc=1
6451                                                          2. s_waitcnt vmcnt(0)
6452
6453                                                            - Must happen before
6454                                                              following
6455                                                              buffer_wbinvl1_vol.
6456                                                            - Ensures the load
6457                                                              has completed
6458                                                              before invalidating
6459                                                              the cache.
6460
6461                                                          3. buffer_wbinvl1_vol
6462
6463                                                            - Must happen before
6464                                                              any following
6465                                                              global/generic
6466                                                              load/load
6467                                                              atomic/atomicrmw.
6468                                                            - Ensures that
6469                                                              following
6470                                                              loads will not see
6471                                                              stale global data.
6472
6473      load atomic  acquire      - system       - global   1. buffer/global/flat_load
6474                                                             glc=1
6475                                                          2. s_waitcnt vmcnt(0)
6476
6477                                                            - Must happen before
6478                                                              following buffer_invl2 and
6479                                                              buffer_wbinvl1_vol.
6480                                                            - Ensures the load
6481                                                              has completed
6482                                                              before invalidating
6483                                                              the cache.
6484
6485                                                          3. buffer_invl2;
6486                                                             buffer_wbinvl1_vol
6487
6488                                                            - Must happen before
6489                                                              any following
6490                                                              global/generic
6491                                                              load/load
6492                                                              atomic/atomicrmw.
6493                                                            - Ensures that
6494                                                              following
6495                                                              loads will not see
6496                                                              stale L1 global data,
6497                                                              nor see stale L2 MTYPE
6498                                                              NC global data.
6499                                                              MTYPE RW and CC memory will
6500                                                              never be stale in L2 due to
6501                                                              the memory probes.
6502
6503      load atomic  acquire      - agent        - generic  1. flat_load glc=1
6504                                                          2. s_waitcnt vmcnt(0) &
6505                                                             lgkmcnt(0)
6506
6507                                                            - If TgSplit execution mode,
6508                                                              omit lgkmcnt(0).
6509                                                            - If OpenCL omit
6510                                                              lgkmcnt(0).
6511                                                            - Must happen before
6512                                                              following
6513                                                              buffer_wbinvl1_vol.
6514                                                            - Ensures the flat_load
6515                                                              has completed
6516                                                              before invalidating
6517                                                              the cache.
6518
6519                                                          3. buffer_wbinvl1_vol
6520
6521                                                            - Must happen before
6522                                                              any following
6523                                                              global/generic
6524                                                              load/load
6525                                                              atomic/atomicrmw.
6526                                                            - Ensures that
6527                                                              following loads
6528                                                              will not see stale
6529                                                              global data.
6530
6531      load atomic  acquire      - system       - generic  1. flat_load glc=1
6532                                                          2. s_waitcnt vmcnt(0) &
6533                                                             lgkmcnt(0)
6534
6535                                                            - If TgSplit execution mode,
6536                                                              omit lgkmcnt(0).
6537                                                            - If OpenCL omit
6538                                                              lgkmcnt(0).
6539                                                            - Must happen before
6540                                                              following
6541                                                              buffer_invl2 and
6542                                                              buffer_wbinvl1_vol.
6543                                                            - Ensures the flat_load
6544                                                              has completed
6545                                                              before invalidating
6546                                                              the caches.
6547
6548                                                          3. buffer_invl2;
6549                                                             buffer_wbinvl1_vol
6550
6551                                                            - Must happen before
6552                                                              any following
6553                                                              global/generic
6554                                                              load/load
6555                                                              atomic/atomicrmw.
6556                                                            - Ensures that
6557                                                              following
6558                                                              loads will not see
6559                                                              stale L1 global data,
6560                                                              nor see stale L2 MTYPE
6561                                                              NC global data.
6562                                                              MTYPE RW and CC memory will
6563                                                              never be stale in L2 due to
6564                                                              the memory probes.
6565
6566      atomicrmw    acquire      - singlethread - global   1. buffer/global/flat_atomic
6567                                - wavefront    - generic
6568      atomicrmw    acquire      - singlethread - local    *If TgSplit execution mode,
6569                                - wavefront               local address space cannot
6570                                                          be used.*
6571
6572                                                          1. ds_atomic
6573      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
6574                                                          2. s_waitcnt vmcnt(0)
6575
6576                                                            - If not TgSplit execution
6577                                                              mode, omit.
6578                                                            - Must happen before the
6579                                                              following buffer_wbinvl1_vol.
6580                                                            - Ensures the atomicrmw
6581                                                              has completed
6582                                                              before invalidating
6583                                                              the cache.
6584
6585                                                          3. buffer_wbinvl1_vol
6586
6587                                                            - If not TgSplit execution
6588                                                              mode, omit.
6589                                                            - Must happen before
6590                                                              any following
6591                                                              global/generic
6592                                                              load/load
6593                                                              atomic/atomicrmw.
6594                                                            - Ensures that
6595                                                              following loads
6596                                                              will not see stale
6597                                                              global data.
6598
6599      atomicrmw    acquire      - workgroup    - local    *If TgSplit execution mode,
6600                                                          local address space cannot
6601                                                          be used.*
6602
6603                                                          1. ds_atomic
6604                                                          2. s_waitcnt lgkmcnt(0)
6605
6606                                                            - If OpenCL, omit.
6607                                                            - Must happen before
6608                                                              any following
6609                                                              global/generic
6610                                                              load/load
6611                                                              atomic/store/store
6612                                                              atomic/atomicrmw.
6613                                                            - Ensures any
6614                                                              following global
6615                                                              data read is no
6616                                                              older than the local
6617                                                              atomicrmw value
6618                                                              being acquired.
6619
6620      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
6621                                                          2. s_waitcnt lgkm/vmcnt(0)
6622
6623                                                            - Use lgkmcnt(0) if not
6624                                                              TgSplit execution mode
6625                                                              and vmcnt(0) if TgSplit
6626                                                              execution mode.
6627                                                            - If OpenCL, omit lgkmcnt(0).
6628                                                            - Must happen before
6629                                                              the following
6630                                                              buffer_wbinvl1_vol and
6631                                                              any following
6632                                                              global/generic
6633                                                              load/load
6634                                                              atomic/store/store
6635                                                              atomic/atomicrmw.
6636                                                            - Ensures any
6637                                                              following global
6638                                                              data read is no
6639                                                              older than a local
6640                                                              atomicrmw value
6641                                                              being acquired.
6642
6643                                                          3. buffer_wbinvl1_vol
6644
6645                                                            - If not TgSplit execution
6646                                                              mode, omit.
6647                                                            - Ensures that
6648                                                              following
6649                                                              loads will not see
6650                                                              stale data.
6651
6652      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
6653                                                          2. s_waitcnt vmcnt(0)
6654
6655                                                            - Must happen before
6656                                                              following
6657                                                              buffer_wbinvl1_vol.
6658                                                            - Ensures the
6659                                                              atomicrmw has
6660                                                              completed before
6661                                                              invalidating the
6662                                                              cache.
6663
6664                                                          3. buffer_wbinvl1_vol
6665
6666                                                            - Must happen before
6667                                                              any following
6668                                                              global/generic
6669                                                              load/load
6670                                                              atomic/atomicrmw.
6671                                                            - Ensures that
6672                                                              following loads
6673                                                              will not see stale
6674                                                              global data.
6675
6676      atomicrmw    acquire      - system       - global   1. buffer/global_atomic
6677                                                          2. s_waitcnt vmcnt(0)
6678
6679                                                            - Must happen before
6680                                                              following buffer_invl2 and
6681                                                              buffer_wbinvl1_vol.
6682                                                            - Ensures the
6683                                                              atomicrmw has
6684                                                              completed before
6685                                                              invalidating the
6686                                                              caches.
6687
6688                                                          3. buffer_invl2;
6689                                                             buffer_wbinvl1_vol
6690
6691                                                            - Must happen before
6692                                                              any following
6693                                                              global/generic
6694                                                              load/load
6695                                                              atomic/atomicrmw.
6696                                                            - Ensures that
6697                                                              following
6698                                                              loads will not see
6699                                                              stale L1 global data,
6700                                                              nor see stale L2 MTYPE
6701                                                              NC global data.
6702                                                              MTYPE RW and CC memory will
6703                                                              never be stale in L2 due to
6704                                                              the memory probes.
6705
6706      atomicrmw    acquire      - agent        - generic  1. flat_atomic
6707                                                          2. s_waitcnt vmcnt(0) &
6708                                                             lgkmcnt(0)
6709
6710                                                            - If TgSplit execution mode,
6711                                                              omit lgkmcnt(0).
6712                                                            - If OpenCL, omit
6713                                                              lgkmcnt(0).
6714                                                            - Must happen before
6715                                                              following
6716                                                              buffer_wbinvl1_vol.
6717                                                            - Ensures the
6718                                                              atomicrmw has
6719                                                              completed before
6720                                                              invalidating the
6721                                                              cache.
6722
6723                                                          3. buffer_wbinvl1_vol
6724
6725                                                            - Must happen before
6726                                                              any following
6727                                                              global/generic
6728                                                              load/load
6729                                                              atomic/atomicrmw.
6730                                                            - Ensures that
6731                                                              following loads
6732                                                              will not see stale
6733                                                              global data.
6734
6735      atomicrmw    acquire      - system       - generic  1. flat_atomic
6736                                                          2. s_waitcnt vmcnt(0) &
6737                                                             lgkmcnt(0)
6738
6739                                                            - If TgSplit execution mode,
6740                                                              omit lgkmcnt(0).
6741                                                            - If OpenCL, omit
6742                                                              lgkmcnt(0).
6743                                                            - Must happen before
6744                                                              following
6745                                                              buffer_invl2 and
6746                                                              buffer_wbinvl1_vol.
6747                                                            - Ensures the
6748                                                              atomicrmw has
6749                                                              completed before
6750                                                              invalidating the
6751                                                              caches.
6752
6753                                                          3. buffer_invl2;
6754                                                             buffer_wbinvl1_vol
6755
6756                                                            - Must happen before
6757                                                              any following
6758                                                              global/generic
6759                                                              load/load
6760                                                              atomic/atomicrmw.
6761                                                            - Ensures that
6762                                                              following
6763                                                              loads will not see
6764                                                              stale L1 global data,
6765                                                              nor see stale L2 MTYPE
6766                                                              NC global data.
6767                                                              MTYPE RW and CC memory will
6768                                                              never be stale in L2 due to
6769                                                              the memory probes.
6770
6771      fence        acquire      - singlethread *none*     *none*
6772                                - wavefront
6773      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
6774
6775                                                            - Use lgkmcnt(0) if not
6776                                                              TgSplit execution mode
6777                                                              and vmcnt(0) if TgSplit
6778                                                              execution mode.
6779                                                            - If OpenCL and
6780                                                              address space is
6781                                                              not generic, omit
6782                                                              lgkmcnt(0).
6783                                                            - If OpenCL and
6784                                                              address space is
6785                                                              local, omit
6786                                                              vmcnt(0).
6787                                                            - However, since LLVM
6788                                                              currently has no
6789                                                              address space on
6790                                                              the fence need to
6791                                                              conservatively
6792                                                              always generate. If
6793                                                              fence had an
6794                                                              address space then
6795                                                              set to address
6796                                                              space of OpenCL
6797                                                              fence flag, or to
6798                                                              generic if both
6799                                                              local and global
6800                                                              flags are
6801                                                              specified.
6802                                                            - s_waitcnt vmcnt(0)
6803                                                              must happen after
6804                                                              any preceding
6805                                                              global/generic load
6806                                                              atomic/
6807                                                              atomicrmw
6808                                                              with an equal or
6809                                                              wider sync scope
6810                                                              and memory ordering
6811                                                              stronger than
6812                                                              unordered (this is
6813                                                              termed the
6814                                                              fence-paired-atomic).
6815                                                            - s_waitcnt lgkmcnt(0)
6816                                                              must happen after
6817                                                              any preceding
6818                                                              local/generic load
6819                                                              atomic/atomicrmw
6820                                                              with an equal or
6821                                                              wider sync scope
6822                                                              and memory ordering
6823                                                              stronger than
6824                                                              unordered (this is
6825                                                              termed the
6826                                                              fence-paired-atomic).
6827                                                            - Must happen before
6828                                                              the following
6829                                                              buffer_wbinvl1_vol and
6830                                                              any following
6831                                                              global/generic
6832                                                              load/load
6833                                                              atomic/store/store
6834                                                              atomic/atomicrmw.
6835                                                            - Ensures any
6836                                                              following global
6837                                                              data read is no
6838                                                              older than the
6839                                                              value read by the
6840                                                              fence-paired-atomic.
6841
6842                                                          2. buffer_wbinvl1_vol
6843
6844                                                            - If not TgSplit execution
6845                                                              mode, omit.
6846                                                            - Ensures that
6847                                                              following
6848                                                              loads will not see
6849                                                              stale data.
6850
6851      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
6852                                                             vmcnt(0)
6853
6854                                                            - If TgSplit execution mode,
6855                                                              omit lgkmcnt(0).
6856                                                            - If OpenCL and
6857                                                              address space is
6858                                                              not generic, omit
6859                                                              lgkmcnt(0).
6860                                                            - However, since LLVM
6861                                                              currently has no
6862                                                              address space on
6863                                                              the fence need to
6864                                                              conservatively
6865                                                              always generate
6866                                                              (see comment for
6867                                                              previous fence).
6868                                                            - Could be split into
6869                                                              separate s_waitcnt
6870                                                              vmcnt(0) and
6871                                                              s_waitcnt
6872                                                              lgkmcnt(0) to allow
6873                                                              them to be
6874                                                              independently moved
6875                                                              according to the
6876                                                              following rules.
6877                                                            - s_waitcnt vmcnt(0)
6878                                                              must happen after
6879                                                              any preceding
6880                                                              global/generic load
6881                                                              atomic/atomicrmw
6882                                                              with an equal or
6883                                                              wider sync scope
6884                                                              and memory ordering
6885                                                              stronger than
6886                                                              unordered (this is
6887                                                              termed the
6888                                                              fence-paired-atomic).
6889                                                            - s_waitcnt lgkmcnt(0)
6890                                                              must happen after
6891                                                              any preceding
6892                                                              local/generic load
6893                                                              atomic/atomicrmw
6894                                                              with an equal or
6895                                                              wider sync scope
6896                                                              and memory ordering
6897                                                              stronger than
6898                                                              unordered (this is
6899                                                              termed the
6900                                                              fence-paired-atomic).
6901                                                            - Must happen before
6902                                                              the following
6903                                                              buffer_wbinvl1_vol.
6904                                                            - Ensures that the
6905                                                              fence-paired atomic
6906                                                              has completed
6907                                                              before invalidating
6908                                                              the
6909                                                              cache. Therefore
6910                                                              any following
6911                                                              locations read must
6912                                                              be no older than
6913                                                              the value read by
6914                                                              the
6915                                                              fence-paired-atomic.
6916
6917                                                          2. buffer_wbinvl1_vol
6918
6919                                                            - Must happen before any
6920                                                              following global/generic
6921                                                              load/load
6922                                                              atomic/store/store
6923                                                              atomic/atomicrmw.
6924                                                            - Ensures that
6925                                                              following loads
6926                                                              will not see stale
6927                                                              global data.
6928
6929      fence        acquire      - system       *none*     1. s_waitcnt lgkmcnt(0) &
6930                                                             vmcnt(0)
6931
6932                                                            - If TgSplit execution mode,
6933                                                              omit lgkmcnt(0).
6934                                                            - If OpenCL and
6935                                                              address space is
6936                                                              not generic, omit
6937                                                              lgkmcnt(0).
6938                                                            - However, since LLVM
6939                                                              currently has no
6940                                                              address space on
6941                                                              the fence need to
6942                                                              conservatively
6943                                                              always generate
6944                                                              (see comment for
6945                                                              previous fence).
6946                                                            - Could be split into
6947                                                              separate s_waitcnt
6948                                                              vmcnt(0) and
6949                                                              s_waitcnt
6950                                                              lgkmcnt(0) to allow
6951                                                              them to be
6952                                                              independently moved
6953                                                              according to the
6954                                                              following rules.
6955                                                            - s_waitcnt vmcnt(0)
6956                                                              must happen after
6957                                                              any preceding
6958                                                              global/generic load
6959                                                              atomic/atomicrmw
6960                                                              with an equal or
6961                                                              wider sync scope
6962                                                              and memory ordering
6963                                                              stronger than
6964                                                              unordered (this is
6965                                                              termed the
6966                                                              fence-paired-atomic).
6967                                                            - s_waitcnt lgkmcnt(0)
6968                                                              must happen after
6969                                                              any preceding
6970                                                              local/generic load
6971                                                              atomic/atomicrmw
6972                                                              with an equal or
6973                                                              wider sync scope
6974                                                              and memory ordering
6975                                                              stronger than
6976                                                              unordered (this is
6977                                                              termed the
6978                                                              fence-paired-atomic).
6979                                                            - Must happen before
6980                                                              the following buffer_invl2 and
6981                                                              buffer_wbinvl1_vol.
6982                                                            - Ensures that the
6983                                                              fence-paired atomic
6984                                                              has completed
6985                                                              before invalidating
6986                                                              the
6987                                                              cache. Therefore
6988                                                              any following
6989                                                              locations read must
6990                                                              be no older than
6991                                                              the value read by
6992                                                              the
6993                                                              fence-paired-atomic.
6994
6995                                                          2. buffer_invl2;
6996                                                             buffer_wbinvl1_vol
6997
6998                                                            - Must happen before any
6999                                                              following global/generic
7000                                                              load/load
7001                                                              atomic/store/store
7002                                                              atomic/atomicrmw.
7003                                                            - Ensures that
7004                                                              following
7005                                                              loads will not see
7006                                                              stale L1 global data,
7007                                                              nor see stale L2 MTYPE
7008                                                              NC global data.
7009                                                              MTYPE RW and CC memory will
7010                                                              never be stale in L2 due to
7011                                                              the memory probes.
7012      **Release Atomic**
7013      ------------------------------------------------------------------------------------
7014      store atomic release      - singlethread - global   1. buffer/global/flat_store
7015                                - wavefront    - generic
7016      store atomic release      - singlethread - local    *If TgSplit execution mode,
7017                                - wavefront               local address space cannot
7018                                                          be used.*
7019
7020                                                          1. ds_store
7021      store atomic release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7022                                               - generic
7023                                                            - Use lgkmcnt(0) if not
7024                                                              TgSplit execution mode
7025                                                              and vmcnt(0) if TgSplit
7026                                                              execution mode.
7027                                                            - If OpenCL, omit lgkmcnt(0).
7028                                                            - s_waitcnt vmcnt(0)
7029                                                              must happen after
7030                                                              any preceding
7031                                                              global/generic load/store/
7032                                                              load atomic/store atomic/
7033                                                              atomicrmw.
7034                                                            - s_waitcnt lgkmcnt(0)
7035                                                              must happen after
7036                                                              any preceding
7037                                                              local/generic
7038                                                              load/store/load
7039                                                              atomic/store
7040                                                              atomic/atomicrmw.
7041                                                            - Must happen before
7042                                                              the following
7043                                                              store.
7044                                                            - Ensures that all
7045                                                              memory operations
7046                                                              have
7047                                                              completed before
7048                                                              performing the
7049                                                              store that is being
7050                                                              released.
7051
7052                                                          2. buffer/global/flat_store
7053      store atomic release      - workgroup    - local    *If TgSplit execution mode,
7054                                                          local address space cannot
7055                                                          be used.*
7056
7057                                                          1. ds_store
7058      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7059                                               - generic     vmcnt(0)
7060
7061                                                            - If TgSplit execution mode,
7062                                                              omit lgkmcnt(0).
7063                                                            - If OpenCL and
7064                                                              address space is
7065                                                              not generic, omit
7066                                                              lgkmcnt(0).
7067                                                            - Could be split into
7068                                                              separate s_waitcnt
7069                                                              vmcnt(0) and
7070                                                              s_waitcnt
7071                                                              lgkmcnt(0) to allow
7072                                                              them to be
7073                                                              independently moved
7074                                                              according to the
7075                                                              following rules.
7076                                                            - s_waitcnt vmcnt(0)
7077                                                              must happen after
7078                                                              any preceding
7079                                                              global/generic
7080                                                              load/store/load
7081                                                              atomic/store
7082                                                              atomic/atomicrmw.
7083                                                            - s_waitcnt lgkmcnt(0)
7084                                                              must happen after
7085                                                              any preceding
7086                                                              local/generic
7087                                                              load/store/load
7088                                                              atomic/store
7089                                                              atomic/atomicrmw.
7090                                                            - Must happen before
7091                                                              the following
7092                                                              store.
7093                                                            - Ensures that all
7094                                                              memory operations
7095                                                              to memory have
7096                                                              completed before
7097                                                              performing the
7098                                                              store that is being
7099                                                              released.
7100
7101                                                          2. buffer/global/flat_store
7102      store atomic release      - system       - global   1. buffer_wbl2
7103                                               - generic
7104                                                            - Must happen before
7105                                                              following s_waitcnt.
7106                                                            - Performs L2 writeback to
7107                                                              ensure previous
7108                                                              global/generic
7109                                                              store/atomicrmw are
7110                                                              visible at system scope.
7111
7112                                                          2. s_waitcnt lgkmcnt(0) &
7113                                                             vmcnt(0)
7114
7115                                                            - If TgSplit execution mode,
7116                                                              omit lgkmcnt(0).
7117                                                            - If OpenCL and
7118                                                              address space is
7119                                                              not generic, omit
7120                                                              lgkmcnt(0).
7121                                                            - Could be split into
7122                                                              separate s_waitcnt
7123                                                              vmcnt(0) and
7124                                                              s_waitcnt
7125                                                              lgkmcnt(0) to allow
7126                                                              them to be
7127                                                              independently moved
7128                                                              according to the
7129                                                              following rules.
7130                                                            - s_waitcnt vmcnt(0)
7131                                                              must happen after any
7132                                                              preceding
7133                                                              global/generic
7134                                                              load/store/load
7135                                                              atomic/store
7136                                                              atomic/atomicrmw.
7137                                                            - s_waitcnt lgkmcnt(0)
7138                                                              must happen after any
7139                                                              preceding
7140                                                              local/generic
7141                                                              load/store/load
7142                                                              atomic/store
7143                                                              atomic/atomicrmw.
7144                                                            - Must happen before
7145                                                              the following
7146                                                              store.
7147                                                            - Ensures that all
7148                                                              memory operations
7149                                                              to memory and the L2
7150                                                              writeback have
7151                                                              completed before
7152                                                              performing the
7153                                                              store that is being
7154                                                              released.
7155
7156                                                          3. buffer/global/flat_store
7157      atomicrmw    release      - singlethread - global   1. buffer/global/flat_atomic
7158                                - wavefront    - generic
7159      atomicrmw    release      - singlethread - local    *If TgSplit execution mode,
7160                                - wavefront               local address space cannot
7161                                                          be used.*
7162
7163                                                          1. ds_atomic
7164      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7165                                               - generic
7166                                                            - Use lgkmcnt(0) if not
7167                                                              TgSplit execution mode
7168                                                              and vmcnt(0) if TgSplit
7169                                                              execution mode.
7170                                                            - If OpenCL, omit
7171                                                              lgkmcnt(0).
7172                                                            - s_waitcnt vmcnt(0)
7173                                                              must happen after
7174                                                              any preceding
7175                                                              global/generic load/store/
7176                                                              load atomic/store atomic/
7177                                                              atomicrmw.
7178                                                            - s_waitcnt lgkmcnt(0)
7179                                                              must happen after
7180                                                              any preceding
7181                                                              local/generic
7182                                                              load/store/load
7183                                                              atomic/store
7184                                                              atomic/atomicrmw.
7185                                                            - Must happen before
7186                                                              the following
7187                                                              atomicrmw.
7188                                                            - Ensures that all
7189                                                              memory operations
7190                                                              have
7191                                                              completed before
7192                                                              performing the
7193                                                              atomicrmw that is
7194                                                              being released.
7195
7196                                                          2. buffer/global/flat_atomic
7197      atomicrmw    release      - workgroup    - local    *If TgSplit execution mode,
7198                                                          local address space cannot
7199                                                          be used.*
7200
7201                                                          1. ds_atomic
7202      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7203                                               - generic     vmcnt(0)
7204
7205                                                            - If TgSplit execution mode,
7206                                                              omit lgkmcnt(0).
7207                                                            - If OpenCL, omit
7208                                                              lgkmcnt(0).
7209                                                            - Could be split into
7210                                                              separate s_waitcnt
7211                                                              vmcnt(0) and
7212                                                              s_waitcnt
7213                                                              lgkmcnt(0) to allow
7214                                                              them to be
7215                                                              independently moved
7216                                                              according to the
7217                                                              following rules.
7218                                                            - s_waitcnt vmcnt(0)
7219                                                              must happen after
7220                                                              any preceding
7221                                                              global/generic
7222                                                              load/store/load
7223                                                              atomic/store
7224                                                              atomic/atomicrmw.
7225                                                            - s_waitcnt lgkmcnt(0)
7226                                                              must happen after
7227                                                              any preceding
7228                                                              local/generic
7229                                                              load/store/load
7230                                                              atomic/store
7231                                                              atomic/atomicrmw.
7232                                                            - Must happen before
7233                                                              the following
7234                                                              atomicrmw.
7235                                                            - Ensures that all
7236                                                              memory operations
7237                                                              to global and local
7238                                                              have completed
7239                                                              before performing
7240                                                              the atomicrmw that
7241                                                              is being released.
7242
7243                                                          2. buffer/global/flat_atomic
7244      atomicrmw    release      - system       - global   1. buffer_wbl2
7245                                               - generic
7246                                                            - Must happen before
7247                                                              following s_waitcnt.
7248                                                            - Performs L2 writeback to
7249                                                              ensure previous
7250                                                              global/generic
7251                                                              store/atomicrmw are
7252                                                              visible at system scope.
7253
7254                                                          2. s_waitcnt lgkmcnt(0) &
7255                                                             vmcnt(0)
7256
7257                                                            - If TgSplit execution mode,
7258                                                              omit lgkmcnt(0).
7259                                                            - If OpenCL, omit
7260                                                              lgkmcnt(0).
7261                                                            - Could be split into
7262                                                              separate s_waitcnt
7263                                                              vmcnt(0) and
7264                                                              s_waitcnt
7265                                                              lgkmcnt(0) to allow
7266                                                              them to be
7267                                                              independently moved
7268                                                              according to the
7269                                                              following rules.
7270                                                            - s_waitcnt vmcnt(0)
7271                                                              must happen after
7272                                                              any preceding
7273                                                              global/generic
7274                                                              load/store/load
7275                                                              atomic/store
7276                                                              atomic/atomicrmw.
7277                                                            - s_waitcnt lgkmcnt(0)
7278                                                              must happen after
7279                                                              any preceding
7280                                                              local/generic
7281                                                              load/store/load
7282                                                              atomic/store
7283                                                              atomic/atomicrmw.
7284                                                            - Must happen before
7285                                                              the following
7286                                                              atomicrmw.
7287                                                            - Ensures that all
7288                                                              memory operations
7289                                                              to memory and the L2
7290                                                              writeback have
7291                                                              completed before
7292                                                              performing the
7293                                                              store that is being
7294                                                              released.
7295
7296                                                          3. buffer/global/flat_atomic
7297      fence        release      - singlethread *none*     *none*
7298                                - wavefront
7299      fence        release      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7300
7301                                                            - Use lgkmcnt(0) if not
7302                                                              TgSplit execution mode
7303                                                              and vmcnt(0) if TgSplit
7304                                                              execution mode.
7305                                                            - If OpenCL and
7306                                                              address space is
7307                                                              not generic, omit
7308                                                              lgkmcnt(0).
7309                                                            - If OpenCL and
7310                                                              address space is
7311                                                              local, omit
7312                                                              vmcnt(0).
7313                                                            - However, since LLVM
7314                                                              currently has no
7315                                                              address space on
7316                                                              the fence need to
7317                                                              conservatively
7318                                                              always generate. If
7319                                                              fence had an
7320                                                              address space then
7321                                                              set to address
7322                                                              space of OpenCL
7323                                                              fence flag, or to
7324                                                              generic if both
7325                                                              local and global
7326                                                              flags are
7327                                                              specified.
7328                                                            - s_waitcnt vmcnt(0)
7329                                                              must happen after
7330                                                              any preceding
7331                                                              global/generic
7332                                                              load/store/
7333                                                              load atomic/store atomic/
7334                                                              atomicrmw.
7335                                                            - s_waitcnt lgkmcnt(0)
7336                                                              must happen after
7337                                                              any preceding
7338                                                              local/generic
7339                                                              load/load
7340                                                              atomic/store/store
7341                                                              atomic/atomicrmw.
7342                                                            - Must happen before
7343                                                              any following store
7344                                                              atomic/atomicrmw
7345                                                              with an equal or
7346                                                              wider sync scope
7347                                                              and memory ordering
7348                                                              stronger than
7349                                                              unordered (this is
7350                                                              termed the
7351                                                              fence-paired-atomic).
7352                                                            - Ensures that all
7353                                                              memory operations
7354                                                              have
7355                                                              completed before
7356                                                              performing the
7357                                                              following
7358                                                              fence-paired-atomic.
7359
7360      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
7361                                                             vmcnt(0)
7362
7363                                                            - If TgSplit execution mode,
7364                                                              omit lgkmcnt(0).
7365                                                            - If OpenCL and
7366                                                              address space is
7367                                                              not generic, omit
7368                                                              lgkmcnt(0).
7369                                                            - If OpenCL and
7370                                                              address space is
7371                                                              local, omit
7372                                                              vmcnt(0).
7373                                                            - However, since LLVM
7374                                                              currently has no
7375                                                              address space on
7376                                                              the fence need to
7377                                                              conservatively
7378                                                              always generate. If
7379                                                              fence had an
7380                                                              address space then
7381                                                              set to address
7382                                                              space of OpenCL
7383                                                              fence flag, or to
7384                                                              generic if both
7385                                                              local and global
7386                                                              flags are
7387                                                              specified.
7388                                                            - Could be split into
7389                                                              separate s_waitcnt
7390                                                              vmcnt(0) and
7391                                                              s_waitcnt
7392                                                              lgkmcnt(0) to allow
7393                                                              them to be
7394                                                              independently moved
7395                                                              according to the
7396                                                              following rules.
7397                                                            - s_waitcnt vmcnt(0)
7398                                                              must happen after
7399                                                              any preceding
7400                                                              global/generic
7401                                                              load/store/load
7402                                                              atomic/store
7403                                                              atomic/atomicrmw.
7404                                                            - s_waitcnt lgkmcnt(0)
7405                                                              must happen after
7406                                                              any preceding
7407                                                              local/generic
7408                                                              load/store/load
7409                                                              atomic/store
7410                                                              atomic/atomicrmw.
7411                                                            - Must happen before
7412                                                              any following store
7413                                                              atomic/atomicrmw
7414                                                              with an equal or
7415                                                              wider sync scope
7416                                                              and memory ordering
7417                                                              stronger than
7418                                                              unordered (this is
7419                                                              termed the
7420                                                              fence-paired-atomic).
7421                                                            - Ensures that all
7422                                                              memory operations
7423                                                              have
7424                                                              completed before
7425                                                              performing the
7426                                                              following
7427                                                              fence-paired-atomic.
7428
7429      fence        release      - system       *none*     1. buffer_wbl2
7430
7431                                                            - If OpenCL and
7432                                                              address space is
7433                                                              local, omit.
7434                                                            - Must happen before
7435                                                              following s_waitcnt.
7436                                                            - Performs L2 writeback to
7437                                                              ensure previous
7438                                                              global/generic
7439                                                              store/atomicrmw are
7440                                                              visible at system scope.
7441
7442                                                          2. s_waitcnt lgkmcnt(0) &
7443                                                             vmcnt(0)
7444
7445                                                            - If TgSplit execution mode,
7446                                                              omit lgkmcnt(0).
7447                                                            - If OpenCL and
7448                                                              address space is
7449                                                              not generic, omit
7450                                                              lgkmcnt(0).
7451                                                            - If OpenCL and
7452                                                              address space is
7453                                                              local, omit
7454                                                              vmcnt(0).
7455                                                            - However, since LLVM
7456                                                              currently has no
7457                                                              address space on
7458                                                              the fence need to
7459                                                              conservatively
7460                                                              always generate. If
7461                                                              fence had an
7462                                                              address space then
7463                                                              set to address
7464                                                              space of OpenCL
7465                                                              fence flag, or to
7466                                                              generic if both
7467                                                              local and global
7468                                                              flags are
7469                                                              specified.
7470                                                            - Could be split into
7471                                                              separate s_waitcnt
7472                                                              vmcnt(0) and
7473                                                              s_waitcnt
7474                                                              lgkmcnt(0) to allow
7475                                                              them to be
7476                                                              independently moved
7477                                                              according to the
7478                                                              following rules.
7479                                                            - s_waitcnt vmcnt(0)
7480                                                              must happen after
7481                                                              any preceding
7482                                                              global/generic
7483                                                              load/store/load
7484                                                              atomic/store
7485                                                              atomic/atomicrmw.
7486                                                            - s_waitcnt lgkmcnt(0)
7487                                                              must happen after
7488                                                              any preceding
7489                                                              local/generic
7490                                                              load/store/load
7491                                                              atomic/store
7492                                                              atomic/atomicrmw.
7493                                                            - Must happen before
7494                                                              any following store
7495                                                              atomic/atomicrmw
7496                                                              with an equal or
7497                                                              wider sync scope
7498                                                              and memory ordering
7499                                                              stronger than
7500                                                              unordered (this is
7501                                                              termed the
7502                                                              fence-paired-atomic).
7503                                                            - Ensures that all
7504                                                              memory operations
7505                                                              have
7506                                                              completed before
7507                                                              performing the
7508                                                              following
7509                                                              fence-paired-atomic.
7510
7511      **Acquire-Release Atomic**
7512      ------------------------------------------------------------------------------------
7513      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/flat_atomic
7514                                - wavefront    - generic
7515      atomicrmw    acq_rel      - singlethread - local    *If TgSplit execution mode,
7516                                - wavefront               local address space cannot
7517                                                          be used.*
7518
7519                                                          1. ds_atomic
7520      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
7521
7522                                                            - Use lgkmcnt(0) if not
7523                                                              TgSplit execution mode
7524                                                              and vmcnt(0) if TgSplit
7525                                                              execution mode.
7526                                                            - If OpenCL, omit
7527                                                              lgkmcnt(0).
7528                                                            - Must happen after
7529                                                              any preceding
7530                                                              local/generic
7531                                                              load/store/load
7532                                                              atomic/store
7533                                                              atomic/atomicrmw.
7534                                                            - s_waitcnt vmcnt(0)
7535                                                              must happen after
7536                                                              any preceding
7537                                                              global/generic load/store/
7538                                                              load atomic/store atomic/
7539                                                              atomicrmw.
7540                                                            - s_waitcnt lgkmcnt(0)
7541                                                              must happen after
7542                                                              any preceding
7543                                                              local/generic
7544                                                              load/store/load
7545                                                              atomic/store
7546                                                              atomic/atomicrmw.
7547                                                            - Must happen before
7548                                                              the following
7549                                                              atomicrmw.
7550                                                            - Ensures that all
7551                                                              memory operations
7552                                                              have
7553                                                              completed before
7554                                                              performing the
7555                                                              atomicrmw that is
7556                                                              being released.
7557
7558                                                          2. buffer/global_atomic
7559                                                          3. s_waitcnt vmcnt(0)
7560
7561                                                            - If not TgSplit execution
7562                                                              mode, omit.
7563                                                            - Must happen before
7564                                                              the following
7565                                                              buffer_wbinvl1_vol.
7566                                                            - Ensures any
7567                                                              following global
7568                                                              data read is no
7569                                                              older than the
7570                                                              atomicrmw value
7571                                                              being acquired.
7572
7573                                                          4. buffer_wbinvl1_vol
7574
7575                                                            - If not TgSplit execution
7576                                                              mode, omit.
7577                                                            - Ensures that
7578                                                              following
7579                                                              loads will not see
7580                                                              stale data.
7581
7582      atomicrmw    acq_rel      - workgroup    - local    *If TgSplit execution mode,
7583                                                          local address space cannot
7584                                                          be used.*
7585
7586                                                          1. ds_atomic
7587                                                          2. s_waitcnt lgkmcnt(0)
7588
7589                                                            - If OpenCL, omit.
7590                                                            - Must happen before
7591                                                              any following
7592                                                              global/generic
7593                                                              load/load
7594                                                              atomic/store/store
7595                                                              atomic/atomicrmw.
7596                                                            - Ensures any
7597                                                              following global
7598                                                              data read is no
7599                                                              older than the local load
7600                                                              atomic value being
7601                                                              acquired.
7602
7603      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkm/vmcnt(0)
7604
7605                                                            - Use lgkmcnt(0) if not
7606                                                              TgSplit execution mode
7607                                                              and vmcnt(0) if TgSplit
7608                                                              execution mode.
7609                                                            - If OpenCL, omit
7610                                                              lgkmcnt(0).
7611                                                            - s_waitcnt vmcnt(0)
7612                                                              must happen after
7613                                                              any preceding
7614                                                              global/generic load/store/
7615                                                              load atomic/store atomic/
7616                                                              atomicrmw.
7617                                                            - s_waitcnt lgkmcnt(0)
7618                                                              must happen after
7619                                                              any preceding
7620                                                              local/generic
7621                                                              load/store/load
7622                                                              atomic/store
7623                                                              atomic/atomicrmw.
7624                                                            - Must happen before
7625                                                              the following
7626                                                              atomicrmw.
7627                                                            - Ensures that all
7628                                                              memory operations
7629                                                              have
7630                                                              completed before
7631                                                              performing the
7632                                                              atomicrmw that is
7633                                                              being released.
7634
7635                                                          2. flat_atomic
7636                                                          3. s_waitcnt lgkmcnt(0) &
7637                                                             vmcnt(0)
7638
7639                                                            - If not TgSplit execution
7640                                                              mode, omit vmcnt(0).
7641                                                            - If OpenCL, omit
7642                                                              lgkmcnt(0).
7643                                                            - Must happen before
7644                                                              the following
7645                                                              buffer_wbinvl1_vol and
7646                                                              any following
7647                                                              global/generic
7648                                                              load/load
7649                                                              atomic/store/store
7650                                                              atomic/atomicrmw.
7651                                                            - Ensures any
7652                                                              following global
7653                                                              data read is no
7654                                                              older than a local load
7655                                                              atomic value being
7656                                                              acquired.
7657
7658                                                          3. buffer_wbinvl1_vol
7659
7660                                                            - If not TgSplit execution
7661                                                              mode, omit.
7662                                                            - Ensures that
7663                                                              following
7664                                                              loads will not see
7665                                                              stale data.
7666
7667      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
7668                                                             vmcnt(0)
7669
7670                                                            - If TgSplit execution mode,
7671                                                              omit lgkmcnt(0).
7672                                                            - If OpenCL, omit
7673                                                              lgkmcnt(0).
7674                                                            - Could be split into
7675                                                              separate s_waitcnt
7676                                                              vmcnt(0) and
7677                                                              s_waitcnt
7678                                                              lgkmcnt(0) to allow
7679                                                              them to be
7680                                                              independently moved
7681                                                              according to the
7682                                                              following rules.
7683                                                            - s_waitcnt vmcnt(0)
7684                                                              must happen after
7685                                                              any preceding
7686                                                              global/generic
7687                                                              load/store/load
7688                                                              atomic/store
7689                                                              atomic/atomicrmw.
7690                                                            - s_waitcnt lgkmcnt(0)
7691                                                              must happen after
7692                                                              any preceding
7693                                                              local/generic
7694                                                              load/store/load
7695                                                              atomic/store
7696                                                              atomic/atomicrmw.
7697                                                            - Must happen before
7698                                                              the following
7699                                                              atomicrmw.
7700                                                            - Ensures that all
7701                                                              memory operations
7702                                                              to global have
7703                                                              completed before
7704                                                              performing the
7705                                                              atomicrmw that is
7706                                                              being released.
7707
7708                                                          2. buffer/global_atomic
7709                                                          3. s_waitcnt vmcnt(0)
7710
7711                                                            - Must happen before
7712                                                              following
7713                                                              buffer_wbinvl1_vol.
7714                                                            - Ensures the
7715                                                              atomicrmw has
7716                                                              completed before
7717                                                              invalidating the
7718                                                              cache.
7719
7720                                                          4. buffer_wbinvl1_vol
7721
7722                                                            - Must happen before
7723                                                              any following
7724                                                              global/generic
7725                                                              load/load
7726                                                              atomic/atomicrmw.
7727                                                            - Ensures that
7728                                                              following loads
7729                                                              will not see stale
7730                                                              global data.
7731
7732      atomicrmw    acq_rel      - system       - global   1. buffer_wbl2
7733
7734                                                            - Must happen before
7735                                                              following s_waitcnt.
7736                                                            - Performs L2 writeback to
7737                                                              ensure previous
7738                                                              global/generic
7739                                                              store/atomicrmw are
7740                                                              visible at system scope.
7741
7742                                                          2. s_waitcnt lgkmcnt(0) &
7743                                                             vmcnt(0)
7744
7745                                                            - If TgSplit execution mode,
7746                                                              omit lgkmcnt(0).
7747                                                            - If OpenCL, omit
7748                                                              lgkmcnt(0).
7749                                                            - Could be split into
7750                                                              separate s_waitcnt
7751                                                              vmcnt(0) and
7752                                                              s_waitcnt
7753                                                              lgkmcnt(0) to allow
7754                                                              them to be
7755                                                              independently moved
7756                                                              according to the
7757                                                              following rules.
7758                                                            - s_waitcnt vmcnt(0)
7759                                                              must happen after
7760                                                              any preceding
7761                                                              global/generic
7762                                                              load/store/load
7763                                                              atomic/store
7764                                                              atomic/atomicrmw.
7765                                                            - s_waitcnt lgkmcnt(0)
7766                                                              must happen after
7767                                                              any preceding
7768                                                              local/generic
7769                                                              load/store/load
7770                                                              atomic/store
7771                                                              atomic/atomicrmw.
7772                                                            - Must happen before
7773                                                              the following
7774                                                              atomicrmw.
7775                                                            - Ensures that all
7776                                                              memory operations
7777                                                              to global and L2 writeback
7778                                                              have completed before
7779                                                              performing the
7780                                                              atomicrmw that is
7781                                                              being released.
7782
7783                                                          3. buffer/global_atomic
7784                                                          4. s_waitcnt vmcnt(0)
7785
7786                                                            - Must happen before
7787                                                              following buffer_invl2 and
7788                                                              buffer_wbinvl1_vol.
7789                                                            - Ensures the
7790                                                              atomicrmw has
7791                                                              completed before
7792                                                              invalidating the
7793                                                              caches.
7794
7795                                                          5. buffer_invl2;
7796                                                             buffer_wbinvl1_vol
7797
7798                                                            - Must happen before
7799                                                              any following
7800                                                              global/generic
7801                                                              load/load
7802                                                              atomic/atomicrmw.
7803                                                            - Ensures that
7804                                                              following
7805                                                              loads will not see
7806                                                              stale L1 global data,
7807                                                              nor see stale L2 MTYPE
7808                                                              NC global data.
7809                                                              MTYPE RW and CC memory will
7810                                                              never be stale in L2 due to
7811                                                              the memory probes.
7812
7813      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
7814                                                             vmcnt(0)
7815
7816                                                            - If TgSplit execution mode,
7817                                                              omit lgkmcnt(0).
7818                                                            - If OpenCL, omit
7819                                                              lgkmcnt(0).
7820                                                            - Could be split into
7821                                                              separate s_waitcnt
7822                                                              vmcnt(0) and
7823                                                              s_waitcnt
7824                                                              lgkmcnt(0) to allow
7825                                                              them to be
7826                                                              independently moved
7827                                                              according to the
7828                                                              following rules.
7829                                                            - s_waitcnt vmcnt(0)
7830                                                              must happen after
7831                                                              any preceding
7832                                                              global/generic
7833                                                              load/store/load
7834                                                              atomic/store
7835                                                              atomic/atomicrmw.
7836                                                            - s_waitcnt lgkmcnt(0)
7837                                                              must happen after
7838                                                              any preceding
7839                                                              local/generic
7840                                                              load/store/load
7841                                                              atomic/store
7842                                                              atomic/atomicrmw.
7843                                                            - Must happen before
7844                                                              the following
7845                                                              atomicrmw.
7846                                                            - Ensures that all
7847                                                              memory operations
7848                                                              to global have
7849                                                              completed before
7850                                                              performing the
7851                                                              atomicrmw that is
7852                                                              being released.
7853
7854                                                          2. flat_atomic
7855                                                          3. s_waitcnt vmcnt(0) &
7856                                                             lgkmcnt(0)
7857
7858                                                            - If TgSplit execution mode,
7859                                                              omit lgkmcnt(0).
7860                                                            - If OpenCL, omit
7861                                                              lgkmcnt(0).
7862                                                            - Must happen before
7863                                                              following
7864                                                              buffer_wbinvl1_vol.
7865                                                            - Ensures the
7866                                                              atomicrmw has
7867                                                              completed before
7868                                                              invalidating the
7869                                                              cache.
7870
7871                                                          4. buffer_wbinvl1_vol
7872
7873                                                            - Must happen before
7874                                                              any following
7875                                                              global/generic
7876                                                              load/load
7877                                                              atomic/atomicrmw.
7878                                                            - Ensures that
7879                                                              following loads
7880                                                              will not see stale
7881                                                              global data.
7882
7883      atomicrmw    acq_rel      - system       - generic  1. buffer_wbl2
7884
7885                                                            - Must happen before
7886                                                              following s_waitcnt.
7887                                                            - Performs L2 writeback to
7888                                                              ensure previous
7889                                                              global/generic
7890                                                              store/atomicrmw are
7891                                                              visible at system scope.
7892
7893                                                          2. s_waitcnt lgkmcnt(0) &
7894                                                             vmcnt(0)
7895
7896                                                            - If TgSplit execution mode,
7897                                                              omit lgkmcnt(0).
7898                                                            - If OpenCL, omit
7899                                                              lgkmcnt(0).
7900                                                            - Could be split into
7901                                                              separate s_waitcnt
7902                                                              vmcnt(0) and
7903                                                              s_waitcnt
7904                                                              lgkmcnt(0) to allow
7905                                                              them to be
7906                                                              independently moved
7907                                                              according to the
7908                                                              following rules.
7909                                                            - s_waitcnt vmcnt(0)
7910                                                              must happen after
7911                                                              any preceding
7912                                                              global/generic
7913                                                              load/store/load
7914                                                              atomic/store
7915                                                              atomic/atomicrmw.
7916                                                            - s_waitcnt lgkmcnt(0)
7917                                                              must happen after
7918                                                              any preceding
7919                                                              local/generic
7920                                                              load/store/load
7921                                                              atomic/store
7922                                                              atomic/atomicrmw.
7923                                                            - Must happen before
7924                                                              the following
7925                                                              atomicrmw.
7926                                                            - Ensures that all
7927                                                              memory operations
7928                                                              to global and L2 writeback
7929                                                              have completed before
7930                                                              performing the
7931                                                              atomicrmw that is
7932                                                              being released.
7933
7934                                                          3. flat_atomic
7935                                                          4. s_waitcnt vmcnt(0) &
7936                                                             lgkmcnt(0)
7937
7938                                                            - If TgSplit execution mode,
7939                                                              omit lgkmcnt(0).
7940                                                            - If OpenCL, omit
7941                                                              lgkmcnt(0).
7942                                                            - Must happen before
7943                                                              following buffer_invl2 and
7944                                                              buffer_wbinvl1_vol.
7945                                                            - Ensures the
7946                                                              atomicrmw has
7947                                                              completed before
7948                                                              invalidating the
7949                                                              caches.
7950
7951                                                          5. buffer_invl2;
7952                                                             buffer_wbinvl1_vol
7953
7954                                                            - Must happen before
7955                                                              any following
7956                                                              global/generic
7957                                                              load/load
7958                                                              atomic/atomicrmw.
7959                                                            - Ensures that
7960                                                              following
7961                                                              loads will not see
7962                                                              stale L1 global data,
7963                                                              nor see stale L2 MTYPE
7964                                                              NC global data.
7965                                                              MTYPE RW and CC memory will
7966                                                              never be stale in L2 due to
7967                                                              the memory probes.
7968
7969      fence        acq_rel      - singlethread *none*     *none*
7970                                - wavefront
7971      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkm/vmcnt(0)
7972
7973                                                            - Use lgkmcnt(0) if not
7974                                                              TgSplit execution mode
7975                                                              and vmcnt(0) if TgSplit
7976                                                              execution mode.
7977                                                            - If OpenCL and
7978                                                              address space is
7979                                                              not generic, omit
7980                                                              lgkmcnt(0).
7981                                                            - If OpenCL and
7982                                                              address space is
7983                                                              local, omit
7984                                                              vmcnt(0).
7985                                                            - However,
7986                                                              since LLVM
7987                                                              currently has no
7988                                                              address space on
7989                                                              the fence need to
7990                                                              conservatively
7991                                                              always generate
7992                                                              (see comment for
7993                                                              previous fence).
7994                                                            - s_waitcnt vmcnt(0)
7995                                                              must happen after
7996                                                              any preceding
7997                                                              global/generic
7998                                                              load/store/
7999                                                              load atomic/store atomic/
8000                                                              atomicrmw.
8001                                                            - s_waitcnt lgkmcnt(0)
8002                                                              must happen after
8003                                                              any preceding
8004                                                              local/generic
8005                                                              load/load
8006                                                              atomic/store/store
8007                                                              atomic/atomicrmw.
8008                                                            - Must happen before
8009                                                              any following
8010                                                              global/generic
8011                                                              load/load
8012                                                              atomic/store/store
8013                                                              atomic/atomicrmw.
8014                                                            - Ensures that all
8015                                                              memory operations
8016                                                              have
8017                                                              completed before
8018                                                              performing any
8019                                                              following global
8020                                                              memory operations.
8021                                                            - Ensures that the
8022                                                              preceding
8023                                                              local/generic load
8024                                                              atomic/atomicrmw
8025                                                              with an equal or
8026                                                              wider sync scope
8027                                                              and memory ordering
8028                                                              stronger than
8029                                                              unordered (this is
8030                                                              termed the
8031                                                              acquire-fence-paired-atomic)
8032                                                              has completed
8033                                                              before following
8034                                                              global memory
8035                                                              operations. This
8036                                                              satisfies the
8037                                                              requirements of
8038                                                              acquire.
8039                                                            - Ensures that all
8040                                                              previous memory
8041                                                              operations have
8042                                                              completed before a
8043                                                              following
8044                                                              local/generic store
8045                                                              atomic/atomicrmw
8046                                                              with an equal or
8047                                                              wider sync scope
8048                                                              and memory ordering
8049                                                              stronger than
8050                                                              unordered (this is
8051                                                              termed the
8052                                                              release-fence-paired-atomic).
8053                                                              This satisfies the
8054                                                              requirements of
8055                                                              release.
8056                                                            - Must happen before
8057                                                              the following
8058                                                              buffer_wbinvl1_vol.
8059                                                            - Ensures that the
8060                                                              acquire-fence-paired
8061                                                              atomic has completed
8062                                                              before invalidating
8063                                                              the
8064                                                              cache. Therefore
8065                                                              any following
8066                                                              locations read must
8067                                                              be no older than
8068                                                              the value read by
8069                                                              the
8070                                                              acquire-fence-paired-atomic.
8071
8072                                                          2. buffer_wbinvl1_vol
8073
8074                                                            - If not TgSplit execution
8075                                                              mode, omit.
8076                                                            - Ensures that
8077                                                              following
8078                                                              loads will not see
8079                                                              stale data.
8080
8081      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
8082                                                             vmcnt(0)
8083
8084                                                            - If TgSplit execution mode,
8085                                                              omit lgkmcnt(0).
8086                                                            - If OpenCL and
8087                                                              address space is
8088                                                              not generic, omit
8089                                                              lgkmcnt(0).
8090                                                            - However, since LLVM
8091                                                              currently has no
8092                                                              address space on
8093                                                              the fence need to
8094                                                              conservatively
8095                                                              always generate
8096                                                              (see comment for
8097                                                              previous fence).
8098                                                            - Could be split into
8099                                                              separate s_waitcnt
8100                                                              vmcnt(0) and
8101                                                              s_waitcnt
8102                                                              lgkmcnt(0) to allow
8103                                                              them to be
8104                                                              independently moved
8105                                                              according to the
8106                                                              following rules.
8107                                                            - s_waitcnt vmcnt(0)
8108                                                              must happen after
8109                                                              any preceding
8110                                                              global/generic
8111                                                              load/store/load
8112                                                              atomic/store
8113                                                              atomic/atomicrmw.
8114                                                            - s_waitcnt lgkmcnt(0)
8115                                                              must happen after
8116                                                              any preceding
8117                                                              local/generic
8118                                                              load/store/load
8119                                                              atomic/store
8120                                                              atomic/atomicrmw.
8121                                                            - Must happen before
8122                                                              the following
8123                                                              buffer_wbinvl1_vol.
8124                                                            - Ensures that the
8125                                                              preceding
8126                                                              global/local/generic
8127                                                              load
8128                                                              atomic/atomicrmw
8129                                                              with an equal or
8130                                                              wider sync scope
8131                                                              and memory ordering
8132                                                              stronger than
8133                                                              unordered (this is
8134                                                              termed the
8135                                                              acquire-fence-paired-atomic)
8136                                                              has completed
8137                                                              before invalidating
8138                                                              the cache. This
8139                                                              satisfies the
8140                                                              requirements of
8141                                                              acquire.
8142                                                            - Ensures that all
8143                                                              previous memory
8144                                                              operations have
8145                                                              completed before a
8146                                                              following
8147                                                              global/local/generic
8148                                                              store
8149                                                              atomic/atomicrmw
8150                                                              with an equal or
8151                                                              wider sync scope
8152                                                              and memory ordering
8153                                                              stronger than
8154                                                              unordered (this is
8155                                                              termed the
8156                                                              release-fence-paired-atomic).
8157                                                              This satisfies the
8158                                                              requirements of
8159                                                              release.
8160
8161                                                          2. buffer_wbinvl1_vol
8162
8163                                                            - Must happen before
8164                                                              any following
8165                                                              global/generic
8166                                                              load/load
8167                                                              atomic/store/store
8168                                                              atomic/atomicrmw.
8169                                                            - Ensures that
8170                                                              following loads
8171                                                              will not see stale
8172                                                              global data. This
8173                                                              satisfies the
8174                                                              requirements of
8175                                                              acquire.
8176
8177      fence        acq_rel      - system       *none*     1. buffer_wbl2
8178
8179                                                            - If OpenCL and
8180                                                              address space is
8181                                                              local, omit.
8182                                                            - Must happen before
8183                                                              following s_waitcnt.
8184                                                            - Performs L2 writeback to
8185                                                              ensure previous
8186                                                              global/generic
8187                                                              store/atomicrmw are
8188                                                              visible at system scope.
8189
8190                                                          2. s_waitcnt lgkmcnt(0) &
8191                                                             vmcnt(0)
8192
8193                                                            - If TgSplit execution mode,
8194                                                              omit lgkmcnt(0).
8195                                                            - If OpenCL and
8196                                                              address space is
8197                                                              not generic, omit
8198                                                              lgkmcnt(0).
8199                                                            - However, since LLVM
8200                                                              currently has no
8201                                                              address space on
8202                                                              the fence need to
8203                                                              conservatively
8204                                                              always generate
8205                                                              (see comment for
8206                                                              previous fence).
8207                                                            - Could be split into
8208                                                              separate s_waitcnt
8209                                                              vmcnt(0) and
8210                                                              s_waitcnt
8211                                                              lgkmcnt(0) to allow
8212                                                              them to be
8213                                                              independently moved
8214                                                              according to the
8215                                                              following rules.
8216                                                            - s_waitcnt vmcnt(0)
8217                                                              must happen after
8218                                                              any preceding
8219                                                              global/generic
8220                                                              load/store/load
8221                                                              atomic/store
8222                                                              atomic/atomicrmw.
8223                                                            - s_waitcnt lgkmcnt(0)
8224                                                              must happen after
8225                                                              any preceding
8226                                                              local/generic
8227                                                              load/store/load
8228                                                              atomic/store
8229                                                              atomic/atomicrmw.
8230                                                            - Must happen before
8231                                                              the following buffer_invl2 and
8232                                                              buffer_wbinvl1_vol.
8233                                                            - Ensures that the
8234                                                              preceding
8235                                                              global/local/generic
8236                                                              load
8237                                                              atomic/atomicrmw
8238                                                              with an equal or
8239                                                              wider sync scope
8240                                                              and memory ordering
8241                                                              stronger than
8242                                                              unordered (this is
8243                                                              termed the
8244                                                              acquire-fence-paired-atomic)
8245                                                              has completed
8246                                                              before invalidating
8247                                                              the cache. This
8248                                                              satisfies the
8249                                                              requirements of
8250                                                              acquire.
8251                                                            - Ensures that all
8252                                                              previous memory
8253                                                              operations have
8254                                                              completed before a
8255                                                              following
8256                                                              global/local/generic
8257                                                              store
8258                                                              atomic/atomicrmw
8259                                                              with an equal or
8260                                                              wider sync scope
8261                                                              and memory ordering
8262                                                              stronger than
8263                                                              unordered (this is
8264                                                              termed the
8265                                                              release-fence-paired-atomic).
8266                                                              This satisfies the
8267                                                              requirements of
8268                                                              release.
8269
8270                                                          3.  buffer_invl2;
8271                                                              buffer_wbinvl1_vol
8272
8273                                                            - Must happen before
8274                                                              any following
8275                                                              global/generic
8276                                                              load/load
8277                                                              atomic/store/store
8278                                                              atomic/atomicrmw.
8279                                                            - Ensures that
8280                                                              following
8281                                                              loads will not see
8282                                                              stale L1 global data,
8283                                                              nor see stale L2 MTYPE
8284                                                              NC global data.
8285                                                              MTYPE RW and CC memory will
8286                                                              never be stale in L2 due to
8287                                                              the memory probes.
8288
8289      **Sequential Consistent Atomic**
8290      ------------------------------------------------------------------------------------
8291      load atomic  seq_cst      - singlethread - global   *Same as corresponding
8292                                - wavefront    - local    load atomic acquire,
8293                                               - generic  except must generate
8294                                                          all instructions even
8295                                                          for OpenCL.*
8296      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkm/vmcnt(0)
8297                                               - generic
8298                                                            - Use lgkmcnt(0) if not
8299                                                              TgSplit execution mode
8300                                                              and vmcnt(0) if TgSplit
8301                                                              execution mode.
8302                                                            - s_waitcnt lgkmcnt(0) must
8303                                                              happen after
8304                                                              preceding
8305                                                              local/generic load
8306                                                              atomic/store
8307                                                              atomic/atomicrmw
8308                                                              with memory
8309                                                              ordering of seq_cst
8310                                                              and with equal or
8311                                                              wider sync scope.
8312                                                              (Note that seq_cst
8313                                                              fences have their
8314                                                              own s_waitcnt
8315                                                              lgkmcnt(0) and so do
8316                                                              not need to be
8317                                                              considered.)
8318                                                            - s_waitcnt vmcnt(0)
8319                                                              must happen after
8320                                                              preceding
8321                                                              global/generic load
8322                                                              atomic/store
8323                                                              atomic/atomicrmw
8324                                                              with memory
8325                                                              ordering of seq_cst
8326                                                              and with equal or
8327                                                              wider sync scope.
8328                                                              (Note that seq_cst
8329                                                              fences have their
8330                                                              own s_waitcnt
8331                                                              vmcnt(0) and so do
8332                                                              not need to be
8333                                                              considered.)
8334                                                            - Ensures any
8335                                                              preceding
8336                                                              sequential
8337                                                              consistent global/local
8338                                                              memory instructions
8339                                                              have completed
8340                                                              before executing
8341                                                              this sequentially
8342                                                              consistent
8343                                                              instruction. This
8344                                                              prevents reordering
8345                                                              a seq_cst store
8346                                                              followed by a
8347                                                              seq_cst load. (Note
8348                                                              that seq_cst is
8349                                                              stronger than
8350                                                              acquire/release as
8351                                                              the reordering of
8352                                                              load acquire
8353                                                              followed by a store
8354                                                              release is
8355                                                              prevented by the
8356                                                              s_waitcnt of
8357                                                              the release, but
8358                                                              there is nothing
8359                                                              preventing a store
8360                                                              release followed by
8361                                                              load acquire from
8362                                                              completing out of
8363                                                              order. The s_waitcnt
8364                                                              could be placed after
8365                                                              seq_store or before
8366                                                              the seq_load. We
8367                                                              choose the load to
8368                                                              make the s_waitcnt be
8369                                                              as late as possible
8370                                                              so that the store
8371                                                              may have already
8372                                                              completed.)
8373
8374                                                          2. *Following
8375                                                             instructions same as
8376                                                             corresponding load
8377                                                             atomic acquire,
8378                                                             except must generate
8379                                                             all instructions even
8380                                                             for OpenCL.*
8381      load atomic  seq_cst      - workgroup    - local    *If TgSplit execution mode,
8382                                                          local address space cannot
8383                                                          be used.*
8384
8385                                                          *Same as corresponding
8386                                                          load atomic acquire,
8387                                                          except must generate
8388                                                          all instructions even
8389                                                          for OpenCL.*
8390
8391      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
8392                                - system       - generic     vmcnt(0)
8393
8394                                                            - If TgSplit execution mode,
8395                                                              omit lgkmcnt(0).
8396                                                            - Could be split into
8397                                                              separate s_waitcnt
8398                                                              vmcnt(0)
8399                                                              and s_waitcnt
8400                                                              lgkmcnt(0) to allow
8401                                                              them to be
8402                                                              independently moved
8403                                                              according to the
8404                                                              following rules.
8405                                                            - s_waitcnt lgkmcnt(0)
8406                                                              must happen after
8407                                                              preceding
8408                                                              global/generic load
8409                                                              atomic/store
8410                                                              atomic/atomicrmw
8411                                                              with memory
8412                                                              ordering of seq_cst
8413                                                              and with equal or
8414                                                              wider sync scope.
8415                                                              (Note that seq_cst
8416                                                              fences have their
8417                                                              own s_waitcnt
8418                                                              lgkmcnt(0) and so do
8419                                                              not need to be
8420                                                              considered.)
8421                                                            - s_waitcnt vmcnt(0)
8422                                                              must happen after
8423                                                              preceding
8424                                                              global/generic load
8425                                                              atomic/store
8426                                                              atomic/atomicrmw
8427                                                              with memory
8428                                                              ordering of seq_cst
8429                                                              and with equal or
8430                                                              wider sync scope.
8431                                                              (Note that seq_cst
8432                                                              fences have their
8433                                                              own s_waitcnt
8434                                                              vmcnt(0) and so do
8435                                                              not need to be
8436                                                              considered.)
8437                                                            - Ensures any
8438                                                              preceding
8439                                                              sequential
8440                                                              consistent global
8441                                                              memory instructions
8442                                                              have completed
8443                                                              before executing
8444                                                              this sequentially
8445                                                              consistent
8446                                                              instruction. This
8447                                                              prevents reordering
8448                                                              a seq_cst store
8449                                                              followed by a
8450                                                              seq_cst load. (Note
8451                                                              that seq_cst is
8452                                                              stronger than
8453                                                              acquire/release as
8454                                                              the reordering of
8455                                                              load acquire
8456                                                              followed by a store
8457                                                              release is
8458                                                              prevented by the
8459                                                              s_waitcnt of
8460                                                              the release, but
8461                                                              there is nothing
8462                                                              preventing a store
8463                                                              release followed by
8464                                                              load acquire from
8465                                                              completing out of
8466                                                              order. The s_waitcnt
8467                                                              could be placed after
8468                                                              seq_store or before
8469                                                              the seq_load. We
8470                                                              choose the load to
8471                                                              make the s_waitcnt be
8472                                                              as late as possible
8473                                                              so that the store
8474                                                              may have already
8475                                                              completed.)
8476
8477                                                          2. *Following
8478                                                             instructions same as
8479                                                             corresponding load
8480                                                             atomic acquire,
8481                                                             except must generate
8482                                                             all instructions even
8483                                                             for OpenCL.*
8484      store atomic seq_cst      - singlethread - global   *Same as corresponding
8485                                - wavefront    - local    store atomic release,
8486                                - workgroup    - generic  except must generate
8487                                - agent                   all instructions even
8488                                - system                  for OpenCL.*
8489      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
8490                                - wavefront    - local    atomicrmw acq_rel,
8491                                - workgroup    - generic  except must generate
8492                                - agent                   all instructions even
8493                                - system                  for OpenCL.*
8494      fence        seq_cst      - singlethread *none*     *Same as corresponding
8495                                - wavefront               fence acq_rel,
8496                                - workgroup               except must generate
8497                                - agent                   all instructions even
8498                                - system                  for OpenCL.*
8499      ============ ============ ============== ========== ================================
8500
8501 .. _amdgpu-amdhsa-memory-model-gfx10:
8502
8503 Memory Model GFX10
8504 ++++++++++++++++++
8505
8506 For GFX10:
8507
8508 * Each agent has multiple shader arrays (SA).
8509 * Each SA has multiple work-group processors (WGP).
8510 * Each WGP has multiple compute units (CU).
8511 * Each CU has multiple SIMDs that execute wavefronts.
8512 * The wavefronts for a single work-group are executed in the same
8513   WGP. In CU wavefront execution mode the wavefronts may be executed by
8514   different SIMDs in the same CU. In WGP wavefront execution mode the
8515   wavefronts may be executed by different SIMDs in different CUs in the same
8516   WGP.
8517 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
8518   executing on it.
8519 * All LDS operations of a WGP are performed as wavefront wide operations in a
8520   global order and involve no caching. Completion is reported to a wavefront in
8521   execution order.
8522 * The LDS memory has multiple request queues shared by the SIMDs of a
8523   WGP. Therefore, the LDS operations performed by different wavefronts of a
8524   work-group can be reordered relative to each other, which can result in
8525   reordering the visibility of vector memory operations with respect to LDS
8526   operations of other wavefronts in the same work-group. A ``s_waitcnt
8527   lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8528   vector memory operations between wavefronts of a work-group, but not between
8529   operations performed by the same wavefront.
8530 * The vector memory operations are performed as wavefront wide operations.
8531   Completion of load/store/sample operations are reported to a wavefront in
8532   execution order of other load/store/sample operations performed by that
8533   wavefront.
8534 * The vector memory operations access a vector L0 cache. There is a single L0
8535   cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
8536   special action is required for coherence between the lanes of a single
8537   wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
8538   wavefronts executing in the same work-group as they may be executing on SIMDs
8539   of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
8540   required for coherence between wavefronts executing in different work-groups
8541   as they may be executing on different WGPs.
8542 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
8543   on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
8544   operations are used in a restricted way so do not impact the memory model. See
8545   :ref:`amdgpu-amdhsa-memory-spaces`.
8546 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
8547   the same SA. Therefore, no special action is required for coherence between
8548   the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
8549   required for coherence between wavefronts executing in different work-groups
8550   as they may be executing on different SAs that access different L1s.
8551 * The L1 caches have independent quadrants to service disjoint ranges of virtual
8552   addresses.
8553 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
8554   vector and scalar memory operations performed by different wavefronts, whether
8555   executing in the same or different work-groups (which may be executing on
8556   different CUs accessing different L0s), can be reordered relative to each
8557   other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
8558   synchronization between vector memory operations of different wavefronts. It
8559   ensures a previous vector memory operation has completed before executing a
8560   subsequent vector memory or LDS operation and so can be used to meet the
8561   requirements of acquire, release and sequential consistency.
8562 * The L1 caches use an L2 cache shared by all SAs on the same agent.
8563 * The L2 cache has independent channels to service disjoint ranges of virtual
8564   addresses.
8565 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
8566   quadrant has a separate request queue per L2 channel. Therefore, the vector
8567   and scalar memory operations performed by wavefronts executing in different
8568   work-groups (which may be executing on different SAs) of an agent can be
8569   reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
8570   required to ensure synchronization between vector memory operations of
8571   different SAs. It ensures a previous vector memory operation has completed
8572   before executing a subsequent vector memory and so can be used to meet the
8573   requirements of acquire, release and sequential consistency.
8574 * The L2 cache can be kept coherent with other agents on some targets, or ranges
8575   of virtual addresses can be set up to bypass it to ensure system coherence.
8576 * On GFX10.3 a memory attached last level (MALL) cache exists for GPU memory.
8577   The MALL cache is fully coherent with GPU memory and has no impact on system
8578   coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
8579
8580 Scalar memory operations are only used to access memory that is proven to not
8581 change during the execution of the kernel dispatch. This includes constant
8582 address space and global address space for program scope ``const`` variables.
8583 Therefore, the kernel machine code does not have to maintain the scalar cache to
8584 ensure it is coherent with the vector caches. The scalar and vector caches are
8585 invalidated between kernel dispatches by CP since constant address space data
8586 may change between kernel dispatch executions. See
8587 :ref:`amdgpu-amdhsa-memory-spaces`.
8588
8589 The one exception is if scalar writes are used to spill SGPR registers. In this
8590 case the AMDGPU backend ensures the memory location used to spill is never
8591 accessed by vector memory operations at the same time. If scalar writes are used
8592 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8593 return since the locations may be used for vector memory instructions by a
8594 future wavefront that uses the same scratch area, or a function call that
8595 creates a frame at the same address, respectively. There is no need for a
8596 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8597
8598 For kernarg backing memory:
8599
8600 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
8601 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
8602   needing to invalidate the L2 cache.
8603 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8604   so the L2 cache will be coherent with the CPU and other agents.
8605
8606 Scratch backing memory (which is used for the private address space) is accessed
8607 with MTYPE NC (non-coherent). Since the private address space is only accessed
8608 by a single thread, and is always write-before-read, there is never a need to
8609 invalidate these entries from the L0 or L1 caches.
8610
8611 Wavefronts are executed in native mode with in-order reporting of loads and
8612 sample instructions. In this mode vmcnt reports completion of load, atomic with
8613 return and sample instructions in order, and the vscnt reports the completion of
8614 store and atomic without return in order. See ``MEM_ORDERED`` field in
8615 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
8616
8617 Wavefronts can be executed in WGP or CU wavefront execution mode:
8618
8619 * In WGP wavefront execution mode the wavefronts of a work-group are executed
8620   on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
8621   CU L0 caches is required for work-group synchronization. Also accesses to L1
8622   at work-group scope need to be explicitly ordered as the accesses from
8623   different CUs are not ordered.
8624 * In CU wavefront execution mode the wavefronts of a work-group are executed on
8625   the SIMDs of a single CU of the WGP. Therefore, all global memory access by
8626   the work-group access the same L0 which in turn ensures L1 accesses are
8627   ordered and so do not require explicit management of the caches for
8628   work-group synchronization.
8629
8630 See ``WGP_MODE`` field in
8631 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and
8632 :ref:`amdgpu-target-features`.
8633
8634 The code sequences used to implement the memory model for GFX10 are defined in
8635 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`.
8636
8637   .. table:: AMDHSA Memory Model Code Sequences GFX10
8638      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table
8639
8640      ============ ============ ============== ========== ================================
8641      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
8642                   Ordering     Sync Scope     Address    GFX10
8643                                               Space
8644      ============ ============ ============== ========== ================================
8645      **Non-Atomic**
8646      ------------------------------------------------------------------------------------
8647      load         *none*       *none*         - global   - !volatile & !nontemporal
8648                                               - generic
8649                                               - private    1. buffer/global/flat_load
8650                                               - constant
8651                                                          - !volatile & nontemporal
8652
8653                                                            1. buffer/global/flat_load
8654                                                               slc=1
8655
8656                                                          - volatile
8657
8658                                                            1. buffer/global/flat_load
8659                                                               glc=1 dlc=1
8660                                                            2. s_waitcnt vmcnt(0)
8661
8662                                                             - Must happen before
8663                                                               any following volatile
8664                                                               global/generic
8665                                                               load/store.
8666                                                             - Ensures that
8667                                                               volatile
8668                                                               operations to
8669                                                               different
8670                                                               addresses will not
8671                                                               be reordered by
8672                                                               hardware.
8673
8674      load         *none*       *none*         - local    1. ds_load
8675      store        *none*       *none*         - global   - !volatile & !nontemporal
8676                                               - generic
8677                                               - private    1. buffer/global/flat_store
8678                                               - constant
8679                                                          - !volatile & nontemporal
8680
8681                                                            1. buffer/global/flat_store
8682                                                               glc=1 slc=1
8683
8684                                                          - volatile
8685
8686                                                            1. buffer/global/flat_store
8687                                                            2. s_waitcnt vscnt(0)
8688
8689                                                             - Must happen before
8690                                                               any following volatile
8691                                                               global/generic
8692                                                               load/store.
8693                                                             - Ensures that
8694                                                               volatile
8695                                                               operations to
8696                                                               different
8697                                                               addresses will not
8698                                                               be reordered by
8699                                                               hardware.
8700
8701      store        *none*       *none*         - local    1. ds_store
8702      **Unordered Atomic**
8703      ------------------------------------------------------------------------------------
8704      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
8705      store atomic unordered    *any*          *any*      *Same as non-atomic*.
8706      atomicrmw    unordered    *any*          *any*      *Same as monotonic atomic*.
8707      **Monotonic Atomic**
8708      ------------------------------------------------------------------------------------
8709      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
8710                                - wavefront    - generic
8711      load atomic  monotonic    - workgroup    - global   1. buffer/global/flat_load
8712                                               - generic     glc=1
8713
8714                                                            - If CU wavefront execution
8715                                                              mode, omit glc=1.
8716
8717      load atomic  monotonic    - singlethread - local    1. ds_load
8718                                - wavefront
8719                                - workgroup
8720      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
8721                                - system       - generic     glc=1 dlc=1
8722      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
8723                                - wavefront    - generic
8724                                - workgroup
8725                                - agent
8726                                - system
8727      store atomic monotonic    - singlethread - local    1. ds_store
8728                                - wavefront
8729                                - workgroup
8730      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
8731                                - wavefront    - generic
8732                                - workgroup
8733                                - agent
8734                                - system
8735      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
8736                                - wavefront
8737                                - workgroup
8738      **Acquire Atomic**
8739      ------------------------------------------------------------------------------------
8740      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
8741                                - wavefront    - local
8742                                               - generic
8743      load atomic  acquire      - workgroup    - global   1. buffer/global_load glc=1
8744
8745                                                            - If CU wavefront execution
8746                                                              mode, omit glc=1.
8747
8748                                                          2. s_waitcnt vmcnt(0)
8749
8750                                                            - If CU wavefront execution
8751                                                              mode, omit.
8752                                                            - Must happen before
8753                                                              the following buffer_gl0_inv
8754                                                              and before any following
8755                                                              global/generic
8756                                                              load/load
8757                                                              atomic/store/store
8758                                                              atomic/atomicrmw.
8759
8760                                                          3. buffer_gl0_inv
8761
8762                                                            - If CU wavefront execution
8763                                                              mode, omit.
8764                                                            - Ensures that
8765                                                              following
8766                                                              loads will not see
8767                                                              stale data.
8768
8769      load atomic  acquire      - workgroup    - local    1. ds_load
8770                                                          2. s_waitcnt lgkmcnt(0)
8771
8772                                                            - If OpenCL, omit.
8773                                                            - Must happen before
8774                                                              the following buffer_gl0_inv
8775                                                              and before any following
8776                                                              global/generic load/load
8777                                                              atomic/store/store
8778                                                              atomic/atomicrmw.
8779                                                            - Ensures any
8780                                                              following global
8781                                                              data read is no
8782                                                              older than the local load
8783                                                              atomic value being
8784                                                              acquired.
8785
8786                                                          3. buffer_gl0_inv
8787
8788                                                            - If CU wavefront execution
8789                                                              mode, omit.
8790                                                            - If OpenCL, omit.
8791                                                            - Ensures that
8792                                                              following
8793                                                              loads will not see
8794                                                              stale data.
8795
8796      load atomic  acquire      - workgroup    - generic  1. flat_load glc=1
8797
8798                                                            - If CU wavefront execution
8799                                                              mode, omit glc=1.
8800
8801                                                          2. s_waitcnt lgkmcnt(0) &
8802                                                             vmcnt(0)
8803
8804                                                            - If CU wavefront execution
8805                                                              mode, omit vmcnt(0).
8806                                                            - If OpenCL, omit
8807                                                              lgkmcnt(0).
8808                                                            - Must happen before
8809                                                              the following
8810                                                              buffer_gl0_inv and any
8811                                                              following global/generic
8812                                                              load/load
8813                                                              atomic/store/store
8814                                                              atomic/atomicrmw.
8815                                                            - Ensures any
8816                                                              following global
8817                                                              data read is no
8818                                                              older than a local load
8819                                                              atomic value being
8820                                                              acquired.
8821
8822                                                          3. buffer_gl0_inv
8823
8824                                                            - If CU wavefront execution
8825                                                              mode, omit.
8826                                                            - Ensures that
8827                                                              following
8828                                                              loads will not see
8829                                                              stale data.
8830
8831      load atomic  acquire      - agent        - global   1. buffer/global_load
8832                                - system                     glc=1 dlc=1
8833                                                          2. s_waitcnt vmcnt(0)
8834
8835                                                            - Must happen before
8836                                                              following
8837                                                              buffer_gl*_inv.
8838                                                            - Ensures the load
8839                                                              has completed
8840                                                              before invalidating
8841                                                              the caches.
8842
8843                                                          3. buffer_gl0_inv;
8844                                                             buffer_gl1_inv
8845
8846                                                            - Must happen before
8847                                                              any following
8848                                                              global/generic
8849                                                              load/load
8850                                                              atomic/atomicrmw.
8851                                                            - Ensures that
8852                                                              following
8853                                                              loads will not see
8854                                                              stale global data.
8855
8856      load atomic  acquire      - agent        - generic  1. flat_load glc=1 dlc=1
8857                                - system                  2. s_waitcnt vmcnt(0) &
8858                                                             lgkmcnt(0)
8859
8860                                                            - If OpenCL omit
8861                                                              lgkmcnt(0).
8862                                                            - Must happen before
8863                                                              following
8864                                                              buffer_gl*_invl.
8865                                                            - Ensures the flat_load
8866                                                              has completed
8867                                                              before invalidating
8868                                                              the caches.
8869
8870                                                          3. buffer_gl0_inv;
8871                                                             buffer_gl1_inv
8872
8873                                                            - Must happen before
8874                                                              any following
8875                                                              global/generic
8876                                                              load/load
8877                                                              atomic/atomicrmw.
8878                                                            - Ensures that
8879                                                              following loads
8880                                                              will not see stale
8881                                                              global data.
8882
8883      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
8884                                - wavefront    - local
8885                                               - generic
8886      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
8887                                                          2. s_waitcnt vm/vscnt(0)
8888
8889                                                            - If CU wavefront execution
8890                                                              mode, omit.
8891                                                            - Use vmcnt(0) if atomic with
8892                                                              return and vscnt(0) if
8893                                                              atomic with no-return.
8894                                                            - Must happen before
8895                                                              the following buffer_gl0_inv
8896                                                              and before any following
8897                                                              global/generic
8898                                                              load/load
8899                                                              atomic/store/store
8900                                                              atomic/atomicrmw.
8901
8902                                                          3. buffer_gl0_inv
8903
8904                                                            - If CU wavefront execution
8905                                                              mode, omit.
8906                                                            - Ensures that
8907                                                              following
8908                                                              loads will not see
8909                                                              stale data.
8910
8911      atomicrmw    acquire      - workgroup    - local    1. ds_atomic
8912                                                          2. s_waitcnt lgkmcnt(0)
8913
8914                                                            - If OpenCL, omit.
8915                                                            - Must happen before
8916                                                              the following
8917                                                              buffer_gl0_inv.
8918                                                            - Ensures any
8919                                                              following global
8920                                                              data read is no
8921                                                              older than the local
8922                                                              atomicrmw value
8923                                                              being acquired.
8924
8925                                                          3. buffer_gl0_inv
8926
8927                                                            - If OpenCL omit.
8928                                                            - Ensures that
8929                                                              following
8930                                                              loads will not see
8931                                                              stale data.
8932
8933      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
8934                                                          2. s_waitcnt lgkmcnt(0) &
8935                                                             vm/vscnt(0)
8936
8937                                                            - If CU wavefront execution
8938                                                              mode, omit vm/vscnt(0).
8939                                                            - If OpenCL, omit lgkmcnt(0).
8940                                                            - Use vmcnt(0) if atomic with
8941                                                              return and vscnt(0) if
8942                                                              atomic with no-return.
8943                                                            - Must happen before
8944                                                              the following
8945                                                              buffer_gl0_inv.
8946                                                            - Ensures any
8947                                                              following global
8948                                                              data read is no
8949                                                              older than a local
8950                                                              atomicrmw value
8951                                                              being acquired.
8952
8953                                                          3. buffer_gl0_inv
8954
8955                                                            - If CU wavefront execution
8956                                                              mode, omit.
8957                                                            - Ensures that
8958                                                              following
8959                                                              loads will not see
8960                                                              stale data.
8961
8962      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
8963                                - system                  2. s_waitcnt vm/vscnt(0)
8964
8965                                                            - Use vmcnt(0) if atomic with
8966                                                              return and vscnt(0) if
8967                                                              atomic with no-return.
8968                                                            - Must happen before
8969                                                              following
8970                                                              buffer_gl*_inv.
8971                                                            - Ensures the
8972                                                              atomicrmw has
8973                                                              completed before
8974                                                              invalidating the
8975                                                              caches.
8976
8977                                                          3. buffer_gl0_inv;
8978                                                             buffer_gl1_inv
8979
8980                                                            - Must happen before
8981                                                              any following
8982                                                              global/generic
8983                                                              load/load
8984                                                              atomic/atomicrmw.
8985                                                            - Ensures that
8986                                                              following loads
8987                                                              will not see stale
8988                                                              global data.
8989
8990      atomicrmw    acquire      - agent        - generic  1. flat_atomic
8991                                - system                  2. s_waitcnt vm/vscnt(0) &
8992                                                             lgkmcnt(0)
8993
8994                                                            - If OpenCL, omit
8995                                                              lgkmcnt(0).
8996                                                            - Use vmcnt(0) if atomic with
8997                                                              return and vscnt(0) if
8998                                                              atomic with no-return.
8999                                                            - Must happen before
9000                                                              following
9001                                                              buffer_gl*_inv.
9002                                                            - Ensures the
9003                                                              atomicrmw has
9004                                                              completed before
9005                                                              invalidating the
9006                                                              caches.
9007
9008                                                          3. buffer_gl0_inv;
9009                                                             buffer_gl1_inv
9010
9011                                                            - Must happen before
9012                                                              any following
9013                                                              global/generic
9014                                                              load/load
9015                                                              atomic/atomicrmw.
9016                                                            - Ensures that
9017                                                              following loads
9018                                                              will not see stale
9019                                                              global data.
9020
9021      fence        acquire      - singlethread *none*     *none*
9022                                - wavefront
9023      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
9024                                                             vmcnt(0) & vscnt(0)
9025
9026                                                            - If CU wavefront execution
9027                                                              mode, omit vmcnt(0) and
9028                                                              vscnt(0).
9029                                                            - If OpenCL and
9030                                                              address space is
9031                                                              not generic, omit
9032                                                              lgkmcnt(0).
9033                                                            - If OpenCL and
9034                                                              address space is
9035                                                              local, omit
9036                                                              vmcnt(0) and vscnt(0).
9037                                                            - However, since LLVM
9038                                                              currently has no
9039                                                              address space on
9040                                                              the fence need to
9041                                                              conservatively
9042                                                              always generate. If
9043                                                              fence had an
9044                                                              address space then
9045                                                              set to address
9046                                                              space of OpenCL
9047                                                              fence flag, or to
9048                                                              generic if both
9049                                                              local and global
9050                                                              flags are
9051                                                              specified.
9052                                                            - Could be split into
9053                                                              separate s_waitcnt
9054                                                              vmcnt(0), s_waitcnt
9055                                                              vscnt(0) and s_waitcnt
9056                                                              lgkmcnt(0) to allow
9057                                                              them to be
9058                                                              independently moved
9059                                                              according to the
9060                                                              following rules.
9061                                                            - s_waitcnt vmcnt(0)
9062                                                              must happen after
9063                                                              any preceding
9064                                                              global/generic load
9065                                                              atomic/
9066                                                              atomicrmw-with-return-value
9067                                                              with an equal or
9068                                                              wider sync scope
9069                                                              and memory ordering
9070                                                              stronger than
9071                                                              unordered (this is
9072                                                              termed the
9073                                                              fence-paired-atomic).
9074                                                            - s_waitcnt vscnt(0)
9075                                                              must happen after
9076                                                              any preceding
9077                                                              global/generic
9078                                                              atomicrmw-no-return-value
9079                                                              with an equal or
9080                                                              wider sync scope
9081                                                              and memory ordering
9082                                                              stronger than
9083                                                              unordered (this is
9084                                                              termed the
9085                                                              fence-paired-atomic).
9086                                                            - s_waitcnt lgkmcnt(0)
9087                                                              must happen after
9088                                                              any preceding
9089                                                              local/generic load
9090                                                              atomic/atomicrmw
9091                                                              with an equal or
9092                                                              wider sync scope
9093                                                              and memory ordering
9094                                                              stronger than
9095                                                              unordered (this is
9096                                                              termed the
9097                                                              fence-paired-atomic).
9098                                                            - Must happen before
9099                                                              the following
9100                                                              buffer_gl0_inv.
9101                                                            - Ensures that the
9102                                                              fence-paired atomic
9103                                                              has completed
9104                                                              before invalidating
9105                                                              the
9106                                                              cache. Therefore
9107                                                              any following
9108                                                              locations read must
9109                                                              be no older than
9110                                                              the value read by
9111                                                              the
9112                                                              fence-paired-atomic.
9113
9114                                                          3. buffer_gl0_inv
9115
9116                                                            - If CU wavefront execution
9117                                                              mode, omit.
9118                                                            - Ensures that
9119                                                              following
9120                                                              loads will not see
9121                                                              stale data.
9122
9123      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9124                                - system                     vmcnt(0) & vscnt(0)
9125
9126                                                            - If OpenCL and
9127                                                              address space is
9128                                                              not generic, omit
9129                                                              lgkmcnt(0).
9130                                                            - If OpenCL and
9131                                                              address space is
9132                                                              local, omit
9133                                                              vmcnt(0) and vscnt(0).
9134                                                            - However, since LLVM
9135                                                              currently has no
9136                                                              address space on
9137                                                              the fence need to
9138                                                              conservatively
9139                                                              always generate
9140                                                              (see comment for
9141                                                              previous fence).
9142                                                            - Could be split into
9143                                                              separate s_waitcnt
9144                                                              vmcnt(0), s_waitcnt
9145                                                              vscnt(0) and s_waitcnt
9146                                                              lgkmcnt(0) to allow
9147                                                              them to be
9148                                                              independently moved
9149                                                              according to the
9150                                                              following rules.
9151                                                            - s_waitcnt vmcnt(0)
9152                                                              must happen after
9153                                                              any preceding
9154                                                              global/generic load
9155                                                              atomic/
9156                                                              atomicrmw-with-return-value
9157                                                              with an equal or
9158                                                              wider sync scope
9159                                                              and memory ordering
9160                                                              stronger than
9161                                                              unordered (this is
9162                                                              termed the
9163                                                              fence-paired-atomic).
9164                                                            - s_waitcnt vscnt(0)
9165                                                              must happen after
9166                                                              any preceding
9167                                                              global/generic
9168                                                              atomicrmw-no-return-value
9169                                                              with an equal or
9170                                                              wider sync scope
9171                                                              and memory ordering
9172                                                              stronger than
9173                                                              unordered (this is
9174                                                              termed the
9175                                                              fence-paired-atomic).
9176                                                            - s_waitcnt lgkmcnt(0)
9177                                                              must happen after
9178                                                              any preceding
9179                                                              local/generic load
9180                                                              atomic/atomicrmw
9181                                                              with an equal or
9182                                                              wider sync scope
9183                                                              and memory ordering
9184                                                              stronger than
9185                                                              unordered (this is
9186                                                              termed the
9187                                                              fence-paired-atomic).
9188                                                            - Must happen before
9189                                                              the following
9190                                                              buffer_gl*_inv.
9191                                                            - Ensures that the
9192                                                              fence-paired atomic
9193                                                              has completed
9194                                                              before invalidating
9195                                                              the
9196                                                              caches. Therefore
9197                                                              any following
9198                                                              locations read must
9199                                                              be no older than
9200                                                              the value read by
9201                                                              the
9202                                                              fence-paired-atomic.
9203
9204                                                          2. buffer_gl0_inv;
9205                                                             buffer_gl1_inv
9206
9207                                                            - Must happen before any
9208                                                              following global/generic
9209                                                              load/load
9210                                                              atomic/store/store
9211                                                              atomic/atomicrmw.
9212                                                            - Ensures that
9213                                                              following loads
9214                                                              will not see stale
9215                                                              global data.
9216
9217      **Release Atomic**
9218      ------------------------------------------------------------------------------------
9219      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
9220                                - wavefront    - local
9221                                               - generic
9222      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
9223                                               - generic     vmcnt(0) & vscnt(0)
9224
9225                                                            - If CU wavefront execution
9226                                                              mode, omit vmcnt(0) and
9227                                                              vscnt(0).
9228                                                            - If OpenCL, omit
9229                                                              lgkmcnt(0).
9230                                                            - Could be split into
9231                                                              separate s_waitcnt
9232                                                              vmcnt(0), s_waitcnt
9233                                                              vscnt(0) and s_waitcnt
9234                                                              lgkmcnt(0) to allow
9235                                                              them to be
9236                                                              independently moved
9237                                                              according to the
9238                                                              following rules.
9239                                                            - s_waitcnt vmcnt(0)
9240                                                              must happen after
9241                                                              any preceding
9242                                                              global/generic load/load
9243                                                              atomic/
9244                                                              atomicrmw-with-return-value.
9245                                                            - s_waitcnt vscnt(0)
9246                                                              must happen after
9247                                                              any preceding
9248                                                              global/generic
9249                                                              store/store
9250                                                              atomic/
9251                                                              atomicrmw-no-return-value.
9252                                                            - s_waitcnt lgkmcnt(0)
9253                                                              must happen after
9254                                                              any preceding
9255                                                              local/generic
9256                                                              load/store/load
9257                                                              atomic/store
9258                                                              atomic/atomicrmw.
9259                                                            - Must happen before
9260                                                              the following
9261                                                              store.
9262                                                            - Ensures that all
9263                                                              memory operations
9264                                                              have
9265                                                              completed before
9266                                                              performing the
9267                                                              store that is being
9268                                                              released.
9269
9270                                                          2. buffer/global/flat_store
9271      store atomic release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
9272
9273                                                            - If CU wavefront execution
9274                                                              mode, omit.
9275                                                            - If OpenCL, omit.
9276                                                            - Could be split into
9277                                                              separate s_waitcnt
9278                                                              vmcnt(0) and s_waitcnt
9279                                                              vscnt(0) to allow
9280                                                              them to be
9281                                                              independently moved
9282                                                              according to the
9283                                                              following rules.
9284                                                            - s_waitcnt vmcnt(0)
9285                                                              must happen after
9286                                                              any preceding
9287                                                              global/generic load/load
9288                                                              atomic/
9289                                                              atomicrmw-with-return-value.
9290                                                            - s_waitcnt vscnt(0)
9291                                                              must happen after
9292                                                              any preceding
9293                                                              global/generic
9294                                                              store/store atomic/
9295                                                              atomicrmw-no-return-value.
9296                                                            - Must happen before
9297                                                              the following
9298                                                              store.
9299                                                            - Ensures that all
9300                                                              global memory
9301                                                              operations have
9302                                                              completed before
9303                                                              performing the
9304                                                              store that is being
9305                                                              released.
9306
9307                                                          2. ds_store
9308      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9309                                - system       - generic     vmcnt(0) & vscnt(0)
9310
9311                                                            - If OpenCL and
9312                                                              address space is
9313                                                              not generic, omit
9314                                                              lgkmcnt(0).
9315                                                            - Could be split into
9316                                                              separate s_waitcnt
9317                                                              vmcnt(0), s_waitcnt vscnt(0)
9318                                                              and s_waitcnt
9319                                                              lgkmcnt(0) to allow
9320                                                              them to be
9321                                                              independently moved
9322                                                              according to the
9323                                                              following rules.
9324                                                            - s_waitcnt vmcnt(0)
9325                                                              must happen after
9326                                                              any preceding
9327                                                              global/generic
9328                                                              load/load
9329                                                              atomic/
9330                                                              atomicrmw-with-return-value.
9331                                                            - s_waitcnt vscnt(0)
9332                                                              must happen after
9333                                                              any preceding
9334                                                              global/generic
9335                                                              store/store atomic/
9336                                                              atomicrmw-no-return-value.
9337                                                            - s_waitcnt lgkmcnt(0)
9338                                                              must happen after
9339                                                              any preceding
9340                                                              local/generic
9341                                                              load/store/load
9342                                                              atomic/store
9343                                                              atomic/atomicrmw.
9344                                                            - Must happen before
9345                                                              the following
9346                                                              store.
9347                                                            - Ensures that all
9348                                                              memory operations
9349                                                              have
9350                                                              completed before
9351                                                              performing the
9352                                                              store that is being
9353                                                              released.
9354
9355                                                          2. buffer/global/flat_store
9356      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
9357                                - wavefront    - local
9358                                               - generic
9359      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
9360                                               - generic     vmcnt(0) & vscnt(0)
9361
9362                                                            - If CU wavefront execution
9363                                                              mode, omit vmcnt(0) and
9364                                                              vscnt(0).
9365                                                            - If OpenCL, omit lgkmcnt(0).
9366                                                            - Could be split into
9367                                                              separate s_waitcnt
9368                                                              vmcnt(0), s_waitcnt
9369                                                              vscnt(0) and s_waitcnt
9370                                                              lgkmcnt(0) to allow
9371                                                              them to be
9372                                                              independently moved
9373                                                              according to the
9374                                                              following rules.
9375                                                            - s_waitcnt vmcnt(0)
9376                                                              must happen after
9377                                                              any preceding
9378                                                              global/generic load/load
9379                                                              atomic/
9380                                                              atomicrmw-with-return-value.
9381                                                            - s_waitcnt vscnt(0)
9382                                                              must happen after
9383                                                              any preceding
9384                                                              global/generic
9385                                                              store/store
9386                                                              atomic/
9387                                                              atomicrmw-no-return-value.
9388                                                            - s_waitcnt lgkmcnt(0)
9389                                                              must happen after
9390                                                              any preceding
9391                                                              local/generic
9392                                                              load/store/load
9393                                                              atomic/store
9394                                                              atomic/atomicrmw.
9395                                                            - Must happen before
9396                                                              the following
9397                                                              atomicrmw.
9398                                                            - Ensures that all
9399                                                              memory operations
9400                                                              have
9401                                                              completed before
9402                                                              performing the
9403                                                              atomicrmw that is
9404                                                              being released.
9405
9406                                                          2. buffer/global/flat_atomic
9407      atomicrmw    release      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
9408
9409                                                            - If CU wavefront execution
9410                                                              mode, omit.
9411                                                            - If OpenCL, omit.
9412                                                            - Could be split into
9413                                                              separate s_waitcnt
9414                                                              vmcnt(0) and s_waitcnt
9415                                                              vscnt(0) to allow
9416                                                              them to be
9417                                                              independently moved
9418                                                              according to the
9419                                                              following rules.
9420                                                            - s_waitcnt vmcnt(0)
9421                                                              must happen after
9422                                                              any preceding
9423                                                              global/generic load/load
9424                                                              atomic/
9425                                                              atomicrmw-with-return-value.
9426                                                            - s_waitcnt vscnt(0)
9427                                                              must happen after
9428                                                              any preceding
9429                                                              global/generic
9430                                                              store/store atomic/
9431                                                              atomicrmw-no-return-value.
9432                                                            - Must happen before
9433                                                              the following
9434                                                              store.
9435                                                            - Ensures that all
9436                                                              global memory
9437                                                              operations have
9438                                                              completed before
9439                                                              performing the
9440                                                              store that is being
9441                                                              released.
9442
9443                                                          2. ds_atomic
9444      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9445                                - system       - generic      vmcnt(0) & vscnt(0)
9446
9447                                                            - If OpenCL, omit
9448                                                              lgkmcnt(0).
9449                                                            - Could be split into
9450                                                              separate s_waitcnt
9451                                                              vmcnt(0), s_waitcnt
9452                                                              vscnt(0) and s_waitcnt
9453                                                              lgkmcnt(0) to allow
9454                                                              them to be
9455                                                              independently moved
9456                                                              according to the
9457                                                              following rules.
9458                                                            - s_waitcnt vmcnt(0)
9459                                                              must happen after
9460                                                              any preceding
9461                                                              global/generic
9462                                                              load/load atomic/
9463                                                              atomicrmw-with-return-value.
9464                                                            - s_waitcnt vscnt(0)
9465                                                              must happen after
9466                                                              any preceding
9467                                                              global/generic
9468                                                              store/store atomic/
9469                                                              atomicrmw-no-return-value.
9470                                                            - s_waitcnt lgkmcnt(0)
9471                                                              must happen after
9472                                                              any preceding
9473                                                              local/generic
9474                                                              load/store/load
9475                                                              atomic/store
9476                                                              atomic/atomicrmw.
9477                                                            - Must happen before
9478                                                              the following
9479                                                              atomicrmw.
9480                                                            - Ensures that all
9481                                                              memory operations
9482                                                              to global and local
9483                                                              have completed
9484                                                              before performing
9485                                                              the atomicrmw that
9486                                                              is being released.
9487
9488                                                          2. buffer/global/flat_atomic
9489      fence        release      - singlethread *none*     *none*
9490                                - wavefront
9491      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
9492                                                             vmcnt(0) & vscnt(0)
9493
9494                                                            - If CU wavefront execution
9495                                                              mode, omit vmcnt(0) and
9496                                                              vscnt(0).
9497                                                            - If OpenCL and
9498                                                              address space is
9499                                                              not generic, omit
9500                                                              lgkmcnt(0).
9501                                                            - If OpenCL and
9502                                                              address space is
9503                                                              local, omit
9504                                                              vmcnt(0) and vscnt(0).
9505                                                            - However, since LLVM
9506                                                              currently has no
9507                                                              address space on
9508                                                              the fence need to
9509                                                              conservatively
9510                                                              always generate. If
9511                                                              fence had an
9512                                                              address space then
9513                                                              set to address
9514                                                              space of OpenCL
9515                                                              fence flag, or to
9516                                                              generic if both
9517                                                              local and global
9518                                                              flags are
9519                                                              specified.
9520                                                            - Could be split into
9521                                                              separate s_waitcnt
9522                                                              vmcnt(0), s_waitcnt
9523                                                              vscnt(0) and s_waitcnt
9524                                                              lgkmcnt(0) to allow
9525                                                              them to be
9526                                                              independently moved
9527                                                              according to the
9528                                                              following rules.
9529                                                            - s_waitcnt vmcnt(0)
9530                                                              must happen after
9531                                                              any preceding
9532                                                              global/generic
9533                                                              load/load
9534                                                              atomic/
9535                                                              atomicrmw-with-return-value.
9536                                                            - s_waitcnt vscnt(0)
9537                                                              must happen after
9538                                                              any preceding
9539                                                              global/generic
9540                                                              store/store atomic/
9541                                                              atomicrmw-no-return-value.
9542                                                            - s_waitcnt lgkmcnt(0)
9543                                                              must happen after
9544                                                              any preceding
9545                                                              local/generic
9546                                                              load/store/load
9547                                                              atomic/store atomic/
9548                                                              atomicrmw.
9549                                                            - Must happen before
9550                                                              any following store
9551                                                              atomic/atomicrmw
9552                                                              with an equal or
9553                                                              wider sync scope
9554                                                              and memory ordering
9555                                                              stronger than
9556                                                              unordered (this is
9557                                                              termed the
9558                                                              fence-paired-atomic).
9559                                                            - Ensures that all
9560                                                              memory operations
9561                                                              have
9562                                                              completed before
9563                                                              performing the
9564                                                              following
9565                                                              fence-paired-atomic.
9566
9567      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
9568                                - system                     vmcnt(0) & vscnt(0)
9569
9570                                                            - If OpenCL and
9571                                                              address space is
9572                                                              not generic, omit
9573                                                              lgkmcnt(0).
9574                                                            - If OpenCL and
9575                                                              address space is
9576                                                              local, omit
9577                                                              vmcnt(0) and vscnt(0).
9578                                                            - However, since LLVM
9579                                                              currently has no
9580                                                              address space on
9581                                                              the fence need to
9582                                                              conservatively
9583                                                              always generate. If
9584                                                              fence had an
9585                                                              address space then
9586                                                              set to address
9587                                                              space of OpenCL
9588                                                              fence flag, or to
9589                                                              generic if both
9590                                                              local and global
9591                                                              flags are
9592                                                              specified.
9593                                                            - Could be split into
9594                                                              separate s_waitcnt
9595                                                              vmcnt(0), s_waitcnt
9596                                                              vscnt(0) and s_waitcnt
9597                                                              lgkmcnt(0) to allow
9598                                                              them to be
9599                                                              independently moved
9600                                                              according to the
9601                                                              following rules.
9602                                                            - s_waitcnt vmcnt(0)
9603                                                              must happen after
9604                                                              any preceding
9605                                                              global/generic
9606                                                              load/load atomic/
9607                                                              atomicrmw-with-return-value.
9608                                                            - s_waitcnt vscnt(0)
9609                                                              must happen after
9610                                                              any preceding
9611                                                              global/generic
9612                                                              store/store atomic/
9613                                                              atomicrmw-no-return-value.
9614                                                            - s_waitcnt lgkmcnt(0)
9615                                                              must happen after
9616                                                              any preceding
9617                                                              local/generic
9618                                                              load/store/load
9619                                                              atomic/store
9620                                                              atomic/atomicrmw.
9621                                                            - Must happen before
9622                                                              any following store
9623                                                              atomic/atomicrmw
9624                                                              with an equal or
9625                                                              wider sync scope
9626                                                              and memory ordering
9627                                                              stronger than
9628                                                              unordered (this is
9629                                                              termed the
9630                                                              fence-paired-atomic).
9631                                                            - Ensures that all
9632                                                              memory operations
9633                                                              have
9634                                                              completed before
9635                                                              performing the
9636                                                              following
9637                                                              fence-paired-atomic.
9638
9639      **Acquire-Release Atomic**
9640      ------------------------------------------------------------------------------------
9641      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
9642                                - wavefront    - local
9643                                               - generic
9644      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
9645                                                             vmcnt(0) & vscnt(0)
9646
9647                                                            - If CU wavefront execution
9648                                                              mode, omit vmcnt(0) and
9649                                                              vscnt(0).
9650                                                            - If OpenCL, omit
9651                                                              lgkmcnt(0).
9652                                                            - Must happen after
9653                                                              any preceding
9654                                                              local/generic
9655                                                              load/store/load
9656                                                              atomic/store
9657                                                              atomic/atomicrmw.
9658                                                            - Could be split into
9659                                                              separate s_waitcnt
9660                                                              vmcnt(0), s_waitcnt
9661                                                              vscnt(0), and s_waitcnt
9662                                                              lgkmcnt(0) to allow
9663                                                              them to be
9664                                                              independently moved
9665                                                              according to the
9666                                                              following rules.
9667                                                            - s_waitcnt vmcnt(0)
9668                                                              must happen after
9669                                                              any preceding
9670                                                              global/generic load/load
9671                                                              atomic/
9672                                                              atomicrmw-with-return-value.
9673                                                            - s_waitcnt vscnt(0)
9674                                                              must happen after
9675                                                              any preceding
9676                                                              global/generic
9677                                                              store/store
9678                                                              atomic/
9679                                                              atomicrmw-no-return-value.
9680                                                            - s_waitcnt lgkmcnt(0)
9681                                                              must happen after
9682                                                              any preceding
9683                                                              local/generic
9684                                                              load/store/load
9685                                                              atomic/store
9686                                                              atomic/atomicrmw.
9687                                                            - Must happen before
9688                                                              the following
9689                                                              atomicrmw.
9690                                                            - Ensures that all
9691                                                              memory operations
9692                                                              have
9693                                                              completed before
9694                                                              performing the
9695                                                              atomicrmw that is
9696                                                              being released.
9697
9698                                                          2. buffer/global_atomic
9699                                                          3. s_waitcnt vm/vscnt(0)
9700
9701                                                            - If CU wavefront execution
9702                                                              mode, omit.
9703                                                            - Use vmcnt(0) if atomic with
9704                                                              return and vscnt(0) if
9705                                                              atomic with no-return.
9706                                                            - Must happen before
9707                                                              the following
9708                                                              buffer_gl0_inv.
9709                                                            - Ensures any
9710                                                              following global
9711                                                              data read is no
9712                                                              older than the
9713                                                              atomicrmw value
9714                                                              being acquired.
9715
9716                                                          4. buffer_gl0_inv
9717
9718                                                            - If CU wavefront execution
9719                                                              mode, omit.
9720                                                            - Ensures that
9721                                                              following
9722                                                              loads will not see
9723                                                              stale data.
9724
9725      atomicrmw    acq_rel      - workgroup    - local    1. s_waitcnt vmcnt(0) & vscnt(0)
9726
9727                                                            - If CU wavefront execution
9728                                                              mode, omit.
9729                                                            - If OpenCL, omit.
9730                                                            - Could be split into
9731                                                              separate s_waitcnt
9732                                                              vmcnt(0) and s_waitcnt
9733                                                              vscnt(0) to allow
9734                                                              them to be
9735                                                              independently moved
9736                                                              according to the
9737                                                              following rules.
9738                                                            - s_waitcnt vmcnt(0)
9739                                                              must happen after
9740                                                              any preceding
9741                                                              global/generic load/load
9742                                                              atomic/
9743                                                              atomicrmw-with-return-value.
9744                                                            - s_waitcnt vscnt(0)
9745                                                              must happen after
9746                                                              any preceding
9747                                                              global/generic
9748                                                              store/store atomic/
9749                                                              atomicrmw-no-return-value.
9750                                                            - Must happen before
9751                                                              the following
9752                                                              store.
9753                                                            - Ensures that all
9754                                                              global memory
9755                                                              operations have
9756                                                              completed before
9757                                                              performing the
9758                                                              store that is being
9759                                                              released.
9760
9761                                                          2. ds_atomic
9762                                                          3. s_waitcnt lgkmcnt(0)
9763
9764                                                            - If OpenCL, omit.
9765                                                            - Must happen before
9766                                                              the following
9767                                                              buffer_gl0_inv.
9768                                                            - Ensures any
9769                                                              following global
9770                                                              data read is no
9771                                                              older than the local load
9772                                                              atomic value being
9773                                                              acquired.
9774
9775                                                          4. buffer_gl0_inv
9776
9777                                                            - If CU wavefront execution
9778                                                              mode, omit.
9779                                                            - If OpenCL omit.
9780                                                            - Ensures that
9781                                                              following
9782                                                              loads will not see
9783                                                              stale data.
9784
9785      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0) &
9786                                                             vmcnt(0) & vscnt(0)
9787
9788                                                            - If CU wavefront execution
9789                                                              mode, omit vmcnt(0) and
9790                                                              vscnt(0).
9791                                                            - If OpenCL, omit lgkmcnt(0).
9792                                                            - Could be split into
9793                                                              separate s_waitcnt
9794                                                              vmcnt(0), s_waitcnt
9795                                                              vscnt(0) and s_waitcnt
9796                                                              lgkmcnt(0) to allow
9797                                                              them to be
9798                                                              independently moved
9799                                                              according to the
9800                                                              following rules.
9801                                                            - s_waitcnt vmcnt(0)
9802                                                              must happen after
9803                                                              any preceding
9804                                                              global/generic load/load
9805                                                              atomic/
9806                                                              atomicrmw-with-return-value.
9807                                                            - s_waitcnt vscnt(0)
9808                                                              must happen after
9809                                                              any preceding
9810                                                              global/generic
9811                                                              store/store
9812                                                              atomic/
9813                                                              atomicrmw-no-return-value.
9814                                                            - s_waitcnt lgkmcnt(0)
9815                                                              must happen after
9816                                                              any preceding
9817                                                              local/generic
9818                                                              load/store/load
9819                                                              atomic/store
9820                                                              atomic/atomicrmw.
9821                                                            - Must happen before
9822                                                              the following
9823                                                              atomicrmw.
9824                                                            - Ensures that all
9825                                                              memory operations
9826                                                              have
9827                                                              completed before
9828                                                              performing the
9829                                                              atomicrmw that is
9830                                                              being released.
9831
9832                                                          2. flat_atomic
9833                                                          3. s_waitcnt lgkmcnt(0) &
9834                                                             vmcnt(0) & vscnt(0)
9835
9836                                                            - If CU wavefront execution
9837                                                              mode, omit vmcnt(0) and
9838                                                              vscnt(0).
9839                                                            - If OpenCL, omit lgkmcnt(0).
9840                                                            - Must happen before
9841                                                              the following
9842                                                              buffer_gl0_inv.
9843                                                            - Ensures any
9844                                                              following global
9845                                                              data read is no
9846                                                              older than the load
9847                                                              atomic value being
9848                                                              acquired.
9849
9850                                                          3. buffer_gl0_inv
9851
9852                                                            - If CU wavefront execution
9853                                                              mode, omit.
9854                                                            - Ensures that
9855                                                              following
9856                                                              loads will not see
9857                                                              stale data.
9858
9859      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
9860                                - system                     vmcnt(0) & vscnt(0)
9861
9862                                                            - If OpenCL, omit
9863                                                              lgkmcnt(0).
9864                                                            - Could be split into
9865                                                              separate s_waitcnt
9866                                                              vmcnt(0), s_waitcnt
9867                                                              vscnt(0) and s_waitcnt
9868                                                              lgkmcnt(0) to allow
9869                                                              them to be
9870                                                              independently moved
9871                                                              according to the
9872                                                              following rules.
9873                                                            - s_waitcnt vmcnt(0)
9874                                                              must happen after
9875                                                              any preceding
9876                                                              global/generic
9877                                                              load/load atomic/
9878                                                              atomicrmw-with-return-value.
9879                                                            - s_waitcnt vscnt(0)
9880                                                              must happen after
9881                                                              any preceding
9882                                                              global/generic
9883                                                              store/store atomic/
9884                                                              atomicrmw-no-return-value.
9885                                                            - s_waitcnt lgkmcnt(0)
9886                                                              must happen after
9887                                                              any preceding
9888                                                              local/generic
9889                                                              load/store/load
9890                                                              atomic/store
9891                                                              atomic/atomicrmw.
9892                                                            - Must happen before
9893                                                              the following
9894                                                              atomicrmw.
9895                                                            - Ensures that all
9896                                                              memory operations
9897                                                              to global have
9898                                                              completed before
9899                                                              performing the
9900                                                              atomicrmw that is
9901                                                              being released.
9902
9903                                                          2. buffer/global_atomic
9904                                                          3. s_waitcnt vm/vscnt(0)
9905
9906                                                            - Use vmcnt(0) if atomic with
9907                                                              return and vscnt(0) if
9908                                                              atomic with no-return.
9909                                                            - Must happen before
9910                                                              following
9911                                                              buffer_gl*_inv.
9912                                                            - Ensures the
9913                                                              atomicrmw has
9914                                                              completed before
9915                                                              invalidating the
9916                                                              caches.
9917
9918                                                          4. buffer_gl0_inv;
9919                                                             buffer_gl1_inv
9920
9921                                                            - Must happen before
9922                                                              any following
9923                                                              global/generic
9924                                                              load/load
9925                                                              atomic/atomicrmw.
9926                                                            - Ensures that
9927                                                              following loads
9928                                                              will not see stale
9929                                                              global data.
9930
9931      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
9932                                - system                     vmcnt(0) & vscnt(0)
9933
9934                                                            - If OpenCL, omit
9935                                                              lgkmcnt(0).
9936                                                            - Could be split into
9937                                                              separate s_waitcnt
9938                                                              vmcnt(0), s_waitcnt
9939                                                              vscnt(0), and s_waitcnt
9940                                                              lgkmcnt(0) to allow
9941                                                              them to be
9942                                                              independently moved
9943                                                              according to the
9944                                                              following rules.
9945                                                            - s_waitcnt vmcnt(0)
9946                                                              must happen after
9947                                                              any preceding
9948                                                              global/generic
9949                                                              load/load atomic
9950                                                              atomicrmw-with-return-value.
9951                                                            - s_waitcnt vscnt(0)
9952                                                              must happen after
9953                                                              any preceding
9954                                                              global/generic
9955                                                              store/store atomic/
9956                                                              atomicrmw-no-return-value.
9957                                                            - s_waitcnt lgkmcnt(0)
9958                                                              must happen after
9959                                                              any preceding
9960                                                              local/generic
9961                                                              load/store/load
9962                                                              atomic/store
9963                                                              atomic/atomicrmw.
9964                                                            - Must happen before
9965                                                              the following
9966                                                              atomicrmw.
9967                                                            - Ensures that all
9968                                                              memory operations
9969                                                              have
9970                                                              completed before
9971                                                              performing the
9972                                                              atomicrmw that is
9973                                                              being released.
9974
9975                                                          2. flat_atomic
9976                                                          3. s_waitcnt vm/vscnt(0) &
9977                                                             lgkmcnt(0)
9978
9979                                                            - If OpenCL, omit
9980                                                              lgkmcnt(0).
9981                                                            - Use vmcnt(0) if atomic with
9982                                                              return and vscnt(0) if
9983                                                              atomic with no-return.
9984                                                            - Must happen before
9985                                                              following
9986                                                              buffer_gl*_inv.
9987                                                            - Ensures the
9988                                                              atomicrmw has
9989                                                              completed before
9990                                                              invalidating the
9991                                                              caches.
9992
9993                                                          4. buffer_gl0_inv;
9994                                                             buffer_gl1_inv
9995
9996                                                            - Must happen before
9997                                                              any following
9998                                                              global/generic
9999                                                              load/load
10000                                                              atomic/atomicrmw.
10001                                                            - Ensures that
10002                                                              following loads
10003                                                              will not see stale
10004                                                              global data.
10005
10006      fence        acq_rel      - singlethread *none*     *none*
10007                                - wavefront
10008      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0) &
10009                                                             vmcnt(0) & vscnt(0)
10010
10011                                                            - If CU wavefront execution
10012                                                              mode, omit vmcnt(0) and
10013                                                              vscnt(0).
10014                                                            - If OpenCL and
10015                                                              address space is
10016                                                              not generic, omit
10017                                                              lgkmcnt(0).
10018                                                            - If OpenCL and
10019                                                              address space is
10020                                                              local, omit
10021                                                              vmcnt(0) and vscnt(0).
10022                                                            - However,
10023                                                              since LLVM
10024                                                              currently has no
10025                                                              address space on
10026                                                              the fence need to
10027                                                              conservatively
10028                                                              always generate
10029                                                              (see comment for
10030                                                              previous fence).
10031                                                            - Could be split into
10032                                                              separate s_waitcnt
10033                                                              vmcnt(0), s_waitcnt
10034                                                              vscnt(0) and s_waitcnt
10035                                                              lgkmcnt(0) to allow
10036                                                              them to be
10037                                                              independently moved
10038                                                              according to the
10039                                                              following rules.
10040                                                            - s_waitcnt vmcnt(0)
10041                                                              must happen after
10042                                                              any preceding
10043                                                              global/generic
10044                                                              load/load
10045                                                              atomic/
10046                                                              atomicrmw-with-return-value.
10047                                                            - s_waitcnt vscnt(0)
10048                                                              must happen after
10049                                                              any preceding
10050                                                              global/generic
10051                                                              store/store atomic/
10052                                                              atomicrmw-no-return-value.
10053                                                            - s_waitcnt lgkmcnt(0)
10054                                                              must happen after
10055                                                              any preceding
10056                                                              local/generic
10057                                                              load/store/load
10058                                                              atomic/store atomic/
10059                                                              atomicrmw.
10060                                                            - Must happen before
10061                                                              any following
10062                                                              global/generic
10063                                                              load/load
10064                                                              atomic/store/store
10065                                                              atomic/atomicrmw.
10066                                                            - Ensures that all
10067                                                              memory operations
10068                                                              have
10069                                                              completed before
10070                                                              performing any
10071                                                              following global
10072                                                              memory operations.
10073                                                            - Ensures that the
10074                                                              preceding
10075                                                              local/generic load
10076                                                              atomic/atomicrmw
10077                                                              with an equal or
10078                                                              wider sync scope
10079                                                              and memory ordering
10080                                                              stronger than
10081                                                              unordered (this is
10082                                                              termed the
10083                                                              acquire-fence-paired-atomic)
10084                                                              has completed
10085                                                              before following
10086                                                              global memory
10087                                                              operations. This
10088                                                              satisfies the
10089                                                              requirements of
10090                                                              acquire.
10091                                                            - Ensures that all
10092                                                              previous memory
10093                                                              operations have
10094                                                              completed before a
10095                                                              following
10096                                                              local/generic store
10097                                                              atomic/atomicrmw
10098                                                              with an equal or
10099                                                              wider sync scope
10100                                                              and memory ordering
10101                                                              stronger than
10102                                                              unordered (this is
10103                                                              termed the
10104                                                              release-fence-paired-atomic).
10105                                                              This satisfies the
10106                                                              requirements of
10107                                                              release.
10108                                                            - Must happen before
10109                                                              the following
10110                                                              buffer_gl0_inv.
10111                                                            - Ensures that the
10112                                                              acquire-fence-paired
10113                                                              atomic has completed
10114                                                              before invalidating
10115                                                              the
10116                                                              cache. Therefore
10117                                                              any following
10118                                                              locations read must
10119                                                              be no older than
10120                                                              the value read by
10121                                                              the
10122                                                              acquire-fence-paired-atomic.
10123
10124                                                          3. buffer_gl0_inv
10125
10126                                                            - If CU wavefront execution
10127                                                              mode, omit.
10128                                                            - Ensures that
10129                                                              following
10130                                                              loads will not see
10131                                                              stale data.
10132
10133      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
10134                                - system                     vmcnt(0) & vscnt(0)
10135
10136                                                            - If OpenCL and
10137                                                              address space is
10138                                                              not generic, omit
10139                                                              lgkmcnt(0).
10140                                                            - If OpenCL and
10141                                                              address space is
10142                                                              local, omit
10143                                                              vmcnt(0) and vscnt(0).
10144                                                            - However, since LLVM
10145                                                              currently has no
10146                                                              address space on
10147                                                              the fence need to
10148                                                              conservatively
10149                                                              always generate
10150                                                              (see comment for
10151                                                              previous fence).
10152                                                            - Could be split into
10153                                                              separate s_waitcnt
10154                                                              vmcnt(0), s_waitcnt
10155                                                              vscnt(0) and s_waitcnt
10156                                                              lgkmcnt(0) to allow
10157                                                              them to be
10158                                                              independently moved
10159                                                              according to the
10160                                                              following rules.
10161                                                            - s_waitcnt vmcnt(0)
10162                                                              must happen after
10163                                                              any preceding
10164                                                              global/generic
10165                                                              load/load
10166                                                              atomic/
10167                                                              atomicrmw-with-return-value.
10168                                                            - s_waitcnt vscnt(0)
10169                                                              must happen after
10170                                                              any preceding
10171                                                              global/generic
10172                                                              store/store atomic/
10173                                                              atomicrmw-no-return-value.
10174                                                            - s_waitcnt lgkmcnt(0)
10175                                                              must happen after
10176                                                              any preceding
10177                                                              local/generic
10178                                                              load/store/load
10179                                                              atomic/store
10180                                                              atomic/atomicrmw.
10181                                                            - Must happen before
10182                                                              the following
10183                                                              buffer_gl*_inv.
10184                                                            - Ensures that the
10185                                                              preceding
10186                                                              global/local/generic
10187                                                              load
10188                                                              atomic/atomicrmw
10189                                                              with an equal or
10190                                                              wider sync scope
10191                                                              and memory ordering
10192                                                              stronger than
10193                                                              unordered (this is
10194                                                              termed the
10195                                                              acquire-fence-paired-atomic)
10196                                                              has completed
10197                                                              before invalidating
10198                                                              the caches. This
10199                                                              satisfies the
10200                                                              requirements of
10201                                                              acquire.
10202                                                            - Ensures that all
10203                                                              previous memory
10204                                                              operations have
10205                                                              completed before a
10206                                                              following
10207                                                              global/local/generic
10208                                                              store
10209                                                              atomic/atomicrmw
10210                                                              with an equal or
10211                                                              wider sync scope
10212                                                              and memory ordering
10213                                                              stronger than
10214                                                              unordered (this is
10215                                                              termed the
10216                                                              release-fence-paired-atomic).
10217                                                              This satisfies the
10218                                                              requirements of
10219                                                              release.
10220
10221                                                          2. buffer_gl0_inv;
10222                                                             buffer_gl1_inv
10223
10224                                                            - Must happen before
10225                                                              any following
10226                                                              global/generic
10227                                                              load/load
10228                                                              atomic/store/store
10229                                                              atomic/atomicrmw.
10230                                                            - Ensures that
10231                                                              following loads
10232                                                              will not see stale
10233                                                              global data. This
10234                                                              satisfies the
10235                                                              requirements of
10236                                                              acquire.
10237
10238      **Sequential Consistent Atomic**
10239      ------------------------------------------------------------------------------------
10240      load atomic  seq_cst      - singlethread - global   *Same as corresponding
10241                                - wavefront    - local    load atomic acquire,
10242                                               - generic  except must generate
10243                                                          all instructions even
10244                                                          for OpenCL.*
10245      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0) &
10246                                               - generic     vmcnt(0) & vscnt(0)
10247
10248                                                            - If CU wavefront execution
10249                                                              mode, omit vmcnt(0) and
10250                                                              vscnt(0).
10251                                                            - Could be split into
10252                                                              separate s_waitcnt
10253                                                              vmcnt(0), s_waitcnt
10254                                                              vscnt(0), and s_waitcnt
10255                                                              lgkmcnt(0) to allow
10256                                                              them to be
10257                                                              independently moved
10258                                                              according to the
10259                                                              following rules.
10260                                                            - s_waitcnt lgkmcnt(0) must
10261                                                              happen after
10262                                                              preceding
10263                                                              local/generic load
10264                                                              atomic/store
10265                                                              atomic/atomicrmw
10266                                                              with memory
10267                                                              ordering of seq_cst
10268                                                              and with equal or
10269                                                              wider sync scope.
10270                                                              (Note that seq_cst
10271                                                              fences have their
10272                                                              own s_waitcnt
10273                                                              lgkmcnt(0) and so do
10274                                                              not need to be
10275                                                              considered.)
10276                                                            - s_waitcnt vmcnt(0)
10277                                                              must happen after
10278                                                              preceding
10279                                                              global/generic load
10280                                                              atomic/
10281                                                              atomicrmw-with-return-value
10282                                                              with memory
10283                                                              ordering of seq_cst
10284                                                              and with equal or
10285                                                              wider sync scope.
10286                                                              (Note that seq_cst
10287                                                              fences have their
10288                                                              own s_waitcnt
10289                                                              vmcnt(0) and so do
10290                                                              not need to be
10291                                                              considered.)
10292                                                            - s_waitcnt vscnt(0)
10293                                                              Must happen after
10294                                                              preceding
10295                                                              global/generic store
10296                                                              atomic/
10297                                                              atomicrmw-no-return-value
10298                                                              with memory
10299                                                              ordering of seq_cst
10300                                                              and with equal or
10301                                                              wider sync scope.
10302                                                              (Note that seq_cst
10303                                                              fences have their
10304                                                              own s_waitcnt
10305                                                              vscnt(0) and so do
10306                                                              not need to be
10307                                                              considered.)
10308                                                            - Ensures any
10309                                                              preceding
10310                                                              sequential
10311                                                              consistent global/local
10312                                                              memory instructions
10313                                                              have completed
10314                                                              before executing
10315                                                              this sequentially
10316                                                              consistent
10317                                                              instruction. This
10318                                                              prevents reordering
10319                                                              a seq_cst store
10320                                                              followed by a
10321                                                              seq_cst load. (Note
10322                                                              that seq_cst is
10323                                                              stronger than
10324                                                              acquire/release as
10325                                                              the reordering of
10326                                                              load acquire
10327                                                              followed by a store
10328                                                              release is
10329                                                              prevented by the
10330                                                              s_waitcnt of
10331                                                              the release, but
10332                                                              there is nothing
10333                                                              preventing a store
10334                                                              release followed by
10335                                                              load acquire from
10336                                                              completing out of
10337                                                              order. The s_waitcnt
10338                                                              could be placed after
10339                                                              seq_store or before
10340                                                              the seq_load. We
10341                                                              choose the load to
10342                                                              make the s_waitcnt be
10343                                                              as late as possible
10344                                                              so that the store
10345                                                              may have already
10346                                                              completed.)
10347
10348                                                          2. *Following
10349                                                             instructions same as
10350                                                             corresponding load
10351                                                             atomic acquire,
10352                                                             except must generate
10353                                                             all instructions even
10354                                                             for OpenCL.*
10355      load atomic  seq_cst      - workgroup    - local
10356
10357                                                          1. s_waitcnt vmcnt(0) & vscnt(0)
10358
10359                                                            - If CU wavefront execution
10360                                                              mode, omit.
10361                                                            - Could be split into
10362                                                              separate s_waitcnt
10363                                                              vmcnt(0) and s_waitcnt
10364                                                              vscnt(0) to allow
10365                                                              them to be
10366                                                              independently moved
10367                                                              according to the
10368                                                              following rules.
10369                                                            - s_waitcnt vmcnt(0)
10370                                                              Must happen after
10371                                                              preceding
10372                                                              global/generic load
10373                                                              atomic/
10374                                                              atomicrmw-with-return-value
10375                                                              with memory
10376                                                              ordering of seq_cst
10377                                                              and with equal or
10378                                                              wider sync scope.
10379                                                              (Note that seq_cst
10380                                                              fences have their
10381                                                              own s_waitcnt
10382                                                              vmcnt(0) and so do
10383                                                              not need to be
10384                                                              considered.)
10385                                                            - s_waitcnt vscnt(0)
10386                                                              Must happen after
10387                                                              preceding
10388                                                              global/generic store
10389                                                              atomic/
10390                                                              atomicrmw-no-return-value
10391                                                              with memory
10392                                                              ordering of seq_cst
10393                                                              and with equal or
10394                                                              wider sync scope.
10395                                                              (Note that seq_cst
10396                                                              fences have their
10397                                                              own s_waitcnt
10398                                                              vscnt(0) and so do
10399                                                              not need to be
10400                                                              considered.)
10401                                                            - Ensures any
10402                                                              preceding
10403                                                              sequential
10404                                                              consistent global
10405                                                              memory instructions
10406                                                              have completed
10407                                                              before executing
10408                                                              this sequentially
10409                                                              consistent
10410                                                              instruction. This
10411                                                              prevents reordering
10412                                                              a seq_cst store
10413                                                              followed by a
10414                                                              seq_cst load. (Note
10415                                                              that seq_cst is
10416                                                              stronger than
10417                                                              acquire/release as
10418                                                              the reordering of
10419                                                              load acquire
10420                                                              followed by a store
10421                                                              release is
10422                                                              prevented by the
10423                                                              s_waitcnt of
10424                                                              the release, but
10425                                                              there is nothing
10426                                                              preventing a store
10427                                                              release followed by
10428                                                              load acquire from
10429                                                              completing out of
10430                                                              order. The s_waitcnt
10431                                                              could be placed after
10432                                                              seq_store or before
10433                                                              the seq_load. We
10434                                                              choose the load to
10435                                                              make the s_waitcnt be
10436                                                              as late as possible
10437                                                              so that the store
10438                                                              may have already
10439                                                              completed.)
10440
10441                                                          2. *Following
10442                                                             instructions same as
10443                                                             corresponding load
10444                                                             atomic acquire,
10445                                                             except must generate
10446                                                             all instructions even
10447                                                             for OpenCL.*
10448
10449      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
10450                                - system       - generic     vmcnt(0) & vscnt(0)
10451
10452                                                            - Could be split into
10453                                                              separate s_waitcnt
10454                                                              vmcnt(0), s_waitcnt
10455                                                              vscnt(0) and s_waitcnt
10456                                                              lgkmcnt(0) to allow
10457                                                              them to be
10458                                                              independently moved
10459                                                              according to the
10460                                                              following rules.
10461                                                            - s_waitcnt lgkmcnt(0)
10462                                                              must happen after
10463                                                              preceding
10464                                                              local load
10465                                                              atomic/store
10466                                                              atomic/atomicrmw
10467                                                              with memory
10468                                                              ordering of seq_cst
10469                                                              and with equal or
10470                                                              wider sync scope.
10471                                                              (Note that seq_cst
10472                                                              fences have their
10473                                                              own s_waitcnt
10474                                                              lgkmcnt(0) and so do
10475                                                              not need to be
10476                                                              considered.)
10477                                                            - s_waitcnt vmcnt(0)
10478                                                              must happen after
10479                                                              preceding
10480                                                              global/generic load
10481                                                              atomic/
10482                                                              atomicrmw-with-return-value
10483                                                              with memory
10484                                                              ordering of seq_cst
10485                                                              and with equal or
10486                                                              wider sync scope.
10487                                                              (Note that seq_cst
10488                                                              fences have their
10489                                                              own s_waitcnt
10490                                                              vmcnt(0) and so do
10491                                                              not need to be
10492                                                              considered.)
10493                                                            - s_waitcnt vscnt(0)
10494                                                              Must happen after
10495                                                              preceding
10496                                                              global/generic store
10497                                                              atomic/
10498                                                              atomicrmw-no-return-value
10499                                                              with memory
10500                                                              ordering of seq_cst
10501                                                              and with equal or
10502                                                              wider sync scope.
10503                                                              (Note that seq_cst
10504                                                              fences have their
10505                                                              own s_waitcnt
10506                                                              vscnt(0) and so do
10507                                                              not need to be
10508                                                              considered.)
10509                                                            - Ensures any
10510                                                              preceding
10511                                                              sequential
10512                                                              consistent global
10513                                                              memory instructions
10514                                                              have completed
10515                                                              before executing
10516                                                              this sequentially
10517                                                              consistent
10518                                                              instruction. This
10519                                                              prevents reordering
10520                                                              a seq_cst store
10521                                                              followed by a
10522                                                              seq_cst load. (Note
10523                                                              that seq_cst is
10524                                                              stronger than
10525                                                              acquire/release as
10526                                                              the reordering of
10527                                                              load acquire
10528                                                              followed by a store
10529                                                              release is
10530                                                              prevented by the
10531                                                              s_waitcnt of
10532                                                              the release, but
10533                                                              there is nothing
10534                                                              preventing a store
10535                                                              release followed by
10536                                                              load acquire from
10537                                                              completing out of
10538                                                              order. The s_waitcnt
10539                                                              could be placed after
10540                                                              seq_store or before
10541                                                              the seq_load. We
10542                                                              choose the load to
10543                                                              make the s_waitcnt be
10544                                                              as late as possible
10545                                                              so that the store
10546                                                              may have already
10547                                                              completed.)
10548
10549                                                          2. *Following
10550                                                             instructions same as
10551                                                             corresponding load
10552                                                             atomic acquire,
10553                                                             except must generate
10554                                                             all instructions even
10555                                                             for OpenCL.*
10556      store atomic seq_cst      - singlethread - global   *Same as corresponding
10557                                - wavefront    - local    store atomic release,
10558                                - workgroup    - generic  except must generate
10559                                - agent                   all instructions even
10560                                - system                  for OpenCL.*
10561      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
10562                                - wavefront    - local    atomicrmw acq_rel,
10563                                - workgroup    - generic  except must generate
10564                                - agent                   all instructions even
10565                                - system                  for OpenCL.*
10566      fence        seq_cst      - singlethread *none*     *Same as corresponding
10567                                - wavefront               fence acq_rel,
10568                                - workgroup               except must generate
10569                                - agent                   all instructions even
10570                                - system                  for OpenCL.*
10571      ============ ============ ============== ========== ================================
10572
10573 Trap Handler ABI
10574 ~~~~~~~~~~~~~~~~
10575
10576 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
10577 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
10578 supports the ``s_trap`` instruction. For usage see:
10579
10580 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
10581 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
10582 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-table`
10583
10584   .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
10585      :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
10586
10587      =================== =============== =============== =======================================
10588      Usage               Code Sequence   Trap Handler    Description
10589                                          Inputs
10590      =================== =============== =============== =======================================
10591      reserved            ``s_trap 0x00``                 Reserved by hardware.
10592      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for Finalizer HSA ``debugtrap``
10593                                            ``queue_ptr`` intrinsic (not implemented).
10594                                          ``VGPR0``:
10595                                            ``arg``
10596      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
10597                                            ``queue_ptr`` the trap instruction. The associated
10598                                                          queue is signalled to put it into the
10599                                                          error state.  When the queue is put in
10600                                                          the error state, the waves executing
10601                                                          dispatches on the queue will be
10602                                                          terminated.
10603      ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
10604                                                            as a no-operation. The trap handler
10605                                                            is entered and immediately returns to
10606                                                            continue execution of the wavefront.
10607                                                          - If the debugger is enabled, causes
10608                                                            the debug trap to be reported by the
10609                                                            debugger and the wavefront is put in
10610                                                            the halt state with the PC at the
10611                                                            instruction.  The debugger must
10612                                                            increment the PC and resume the wave.
10613      reserved            ``s_trap 0x04``                 Reserved.
10614      reserved            ``s_trap 0x05``                 Reserved.
10615      reserved            ``s_trap 0x06``                 Reserved.
10616      reserved            ``s_trap 0x07``                 Reserved.
10617      reserved            ``s_trap 0x08``                 Reserved.
10618      reserved            ``s_trap 0xfe``                 Reserved.
10619      reserved            ``s_trap 0xff``                 Reserved.
10620      =================== =============== =============== =======================================
10621
10622 ..
10623
10624   .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
10625      :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
10626
10627      =================== =============== =============== =======================================
10628      Usage               Code Sequence   Trap Handler    Description
10629                                          Inputs
10630      =================== =============== =============== =======================================
10631      reserved            ``s_trap 0x00``                 Reserved by hardware.
10632      debugger breakpoint ``s_trap 0x01`` *none*          Reserved for debugger to use for
10633                                                          breakpoints. Causes wave to be halted
10634                                                          with the PC at the trap instruction.
10635                                                          The debugger is responsible to resume
10636                                                          the wave, including the instruction
10637                                                          that the breakpoint overwrote.
10638      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes wave to be halted with the PC at
10639                                            ``queue_ptr`` the trap instruction. The associated
10640                                                          queue is signalled to put it into the
10641                                                          error state.  When the queue is put in
10642                                                          the error state, the waves executing
10643                                                          dispatches on the queue will be
10644                                                          terminated.
10645      ``llvm.debugtrap``  ``s_trap 0x03`` *none*          - If debugger not enabled then behaves
10646                                                            as a no-operation. The trap handler
10647                                                            is entered and immediately returns to
10648                                                            continue execution of the wavefront.
10649                                                          - If the debugger is enabled, causes
10650                                                            the debug trap to be reported by the
10651                                                            debugger and the wavefront is put in
10652                                                            the halt state with the PC at the
10653                                                            instruction.  The debugger must
10654                                                            increment the PC and resume the wave.
10655      reserved            ``s_trap 0x04``                 Reserved.
10656      reserved            ``s_trap 0x05``                 Reserved.
10657      reserved            ``s_trap 0x06``                 Reserved.
10658      reserved            ``s_trap 0x07``                 Reserved.
10659      reserved            ``s_trap 0x08``                 Reserved.
10660      reserved            ``s_trap 0xfe``                 Reserved.
10661      reserved            ``s_trap 0xff``                 Reserved.
10662      =================== =============== =============== =======================================
10663
10664 ..
10665
10666   .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4
10667      :name: amdgpu-trap-handler-for-amdhsa-os-v4-table
10668
10669      =================== =============== ================ ================= =======================================
10670      Usage               Code Sequence   GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description
10671      =================== =============== ================ ================= =======================================
10672      reserved            ``s_trap 0x00``                                    Reserved by hardware.
10673      debugger breakpoint ``s_trap 0x01`` *none*           *none*            Reserved for debugger to use for
10674                                                                             breakpoints. Causes wave to be halted
10675                                                                             with the PC at the trap instruction.
10676                                                                             The debugger is responsible to resume
10677                                                                             the wave, including the instruction
10678                                                                             that the breakpoint overwrote.
10679      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:     *none*            Causes wave to be halted with the PC at
10680                                            ``queue_ptr``                    the trap instruction. The associated
10681                                                                             queue is signalled to put it into the
10682                                                                             error state.  When the queue is put in
10683                                                                             the error state, the waves executing
10684                                                                             dispatches on the queue will be
10685                                                                             terminated.
10686      ``llvm.debugtrap``  ``s_trap 0x03`` *none*           *none*            - If debugger not enabled then behaves
10687                                                                               as a no-operation. The trap handler
10688                                                                               is entered and immediately returns to
10689                                                                               continue execution of the wavefront.
10690                                                                             - If the debugger is enabled, causes
10691                                                                               the debug trap to be reported by the
10692                                                                               debugger and the wavefront is put in
10693                                                                               the halt state with the PC at the
10694                                                                               instruction.  The debugger must
10695                                                                               increment the PC and resume the wave.
10696      reserved            ``s_trap 0x04``                                    Reserved.
10697      reserved            ``s_trap 0x05``                                    Reserved.
10698      reserved            ``s_trap 0x06``                                    Reserved.
10699      reserved            ``s_trap 0x07``                                    Reserved.
10700      reserved            ``s_trap 0x08``                                    Reserved.
10701      reserved            ``s_trap 0xfe``                                    Reserved.
10702      reserved            ``s_trap 0xff``                                    Reserved.
10703      =================== =============== ================ ================= =======================================
10704
10705 .. _amdgpu-amdhsa-function-call-convention:
10706
10707 Call Convention
10708 ~~~~~~~~~~~~~~~
10709
10710 .. note::
10711
10712   This section is currently incomplete and has inaccuracies. It is WIP that will
10713   be updated as information is determined.
10714
10715 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
10716 addresses. Unswizzled addresses are normal linear addresses.
10717
10718 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
10719
10720 Kernel Functions
10721 ++++++++++++++++
10722
10723 This section describes the call convention ABI for the outer kernel function.
10724
10725 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
10726 convention.
10727
10728 The following is not part of the AMDGPU kernel calling convention but describes
10729 how the AMDGPU implements function calls:
10730
10731 1.  Clang decides the kernarg layout to match the *HSA Programmer's Language
10732     Reference* [HSA]_.
10733
10734     - All structs are passed directly.
10735     - Lambda values are passed *TBA*.
10736
10737     .. TODO::
10738
10739       - Does this really follow HSA rules? Or are structs >16 bytes passed
10740         by-value struct?
10741       - What is ABI for lambda values?
10742
10743 4.  The kernel performs certain setup in its prolog, as described in
10744     :ref:`amdgpu-amdhsa-kernel-prolog`.
10745
10746 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
10747
10748 Non-Kernel Functions
10749 ++++++++++++++++++++
10750
10751 This section describes the call convention ABI for functions other than the
10752 outer kernel function.
10753
10754 If a kernel has function calls then scratch is always allocated and used for
10755 the call stack which grows from low address to high address using the swizzled
10756 scratch address space.
10757
10758 On entry to a function:
10759
10760 1.  SGPR0-3 contain a V# with the following properties (see
10761     :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
10762
10763     * Base address pointing to the beginning of the wavefront scratch backing
10764       memory.
10765     * Swizzled with dword element size and stride of wavefront size elements.
10766
10767 2.  The FLAT_SCRATCH register pair is setup. See
10768     :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
10769 3.  GFX6-GFX8: M0 register set to the size of LDS in bytes. See
10770     :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
10771 4.  The EXEC register is set to the lanes active on entry to the function.
10772 5.  MODE register: *TBD*
10773 6.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
10774     below.
10775 7.  SGPR30-31 return address (RA). The code address that the function must
10776     return to when it completes. The value is undefined if the function is *no
10777     return*.
10778 8.  SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
10779     offset relative to the beginning of the wavefront scratch backing memory.
10780
10781     The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
10782     offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
10783     manner.
10784
10785     The unswizzled SP value can be converted into the swizzled SP value by:
10786
10787       | swizzled SP = unswizzled SP / wavefront size
10788
10789     This may be used to obtain the private address space address of stack
10790     objects and to convert this address to a flat address by adding the flat
10791     scratch aperture base address.
10792
10793     The swizzled SP value is always 4 bytes aligned for the ``r600``
10794     architecture and 16 byte aligned for the ``amdgcn`` architecture.
10795
10796     .. note::
10797
10798       The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
10799       OpenCL language which has the largest base type defined as 16 bytes.
10800
10801     On entry, the swizzled SP value is the address of the first function
10802     argument passed on the stack. Other stack passed arguments are positive
10803     offsets from the entry swizzled SP value.
10804
10805     The function may use positive offsets beyond the last stack passed argument
10806     for stack allocated local variables and register spill slots. If necessary,
10807     the function may align these to greater alignment than 16 bytes. After these
10808     the function may dynamically allocate space for such things as runtime sized
10809     ``alloca`` local allocations.
10810
10811     If the function calls another function, it will place any stack allocated
10812     arguments after the last local allocation and adjust SGPR32 to the address
10813     after the last local allocation.
10814
10815 9.  All other registers are unspecified.
10816 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
10817     to the function.
10818
10819 On exit from a function:
10820
10821 1.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
10822     described below. Any registers used are considered clobbered registers.
10823 2.  The following registers are preserved and have the same value as on entry:
10824
10825     * FLAT_SCRATCH
10826     * EXEC
10827     * GFX6-GFX8: M0
10828     * All SGPR registers except the clobbered registers of SGPR4-31.
10829     * VGPR40-47
10830     * VGPR56-63
10831     * VGPR72-79
10832     * VGPR88-95
10833     * VGPR104-111
10834     * VGPR120-127
10835     * VGPR136-143
10836     * VGPR152-159
10837     * VGPR168-175
10838     * VGPR184-191
10839     * VGPR200-207
10840     * VGPR216-223
10841     * VGPR232-239
10842     * VGPR248-255
10843
10844         .. note::
10845
10846           Except the argument registers, the VGPRs clobbered and the preserved
10847           registers are intermixed at regular intervals in order to keep a
10848           similar ratio independent of the number of allocated VGPRs.
10849
10850     * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
10851     * Lanes of all VGPRs that are inactive at the call site.
10852
10853       For the AMDGPU backend, an inter-procedural register allocation (IPRA)
10854       optimization may mark some of clobbered SGPR and VGPR registers as
10855       preserved if it can be determined that the called function does not change
10856       their value.
10857
10858 2.  The PC is set to the RA provided on entry.
10859 3.  MODE register: *TBD*.
10860 4.  All other registers are clobbered.
10861 5.  Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
10862     function is available to the caller.
10863
10864 .. TODO::
10865
10866   - How are function results returned? The address of structured types is passed
10867     by reference, but what about other types?
10868
10869 The function input arguments are made up of the formal arguments explicitly
10870 declared by the source language function plus the implicit input arguments used
10871 by the implementation.
10872
10873 The source language input arguments are:
10874
10875 1. Any source language implicit ``this`` or ``self`` argument comes first as a
10876    pointer type.
10877 2. Followed by the function formal arguments in left to right source order.
10878
10879 The source language result arguments are:
10880
10881 1. The function result argument.
10882
10883 The source language input or result struct type arguments that are less than or
10884 equal to 16 bytes, are decomposed recursively into their base type fields, and
10885 each field is passed as if a separate argument. For input arguments, if the
10886 called function requires the struct to be in memory, for example because its
10887 address is taken, then the function body is responsible for allocating a stack
10888 location and copying the field arguments into it. Clang terms this *direct
10889 struct*.
10890
10891 The source language input struct type arguments that are greater than 16 bytes,
10892 are passed by reference. The caller is responsible for allocating a stack
10893 location to make a copy of the struct value and pass the address as the input
10894 argument. The called function is responsible to perform the dereference when
10895 accessing the input argument. Clang terms this *by-value struct*.
10896
10897 A source language result struct type argument that is greater than 16 bytes, is
10898 returned by reference. The caller is responsible for allocating a stack location
10899 to hold the result value and passes the address as the last input argument
10900 (before the implicit input arguments). In this case there are no result
10901 arguments. The called function is responsible to perform the dereference when
10902 storing the result value. Clang terms this *structured return (sret)*.
10903
10904 *TODO: correct the ``sret`` definition.*
10905
10906 .. TODO::
10907
10908   Is this definition correct? Or is ``sret`` only used if passing in registers, and
10909   pass as non-decomposed struct as stack argument? Or something else? Is the
10910   memory location in the caller stack frame, or a stack memory argument and so
10911   no address is passed as the caller can directly write to the argument stack
10912   location? But then the stack location is still live after return. If an
10913   argument stack location is it the first stack argument or the last one?
10914
10915 Lambda argument types are treated as struct types with an implementation defined
10916 set of fields.
10917
10918 .. TODO::
10919
10920   Need to specify the ABI for lambda types for AMDGPU.
10921
10922 For AMDGPU backend all source language arguments (including the decomposed
10923 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
10924 they are passed in SGPRs.
10925
10926 The AMDGPU backend walks the function call graph from the leaves to determine
10927 which implicit input arguments are used, propagating to each caller of the
10928 function. The used implicit arguments are appended to the function arguments
10929 after the source language arguments in the following order:
10930
10931 .. TODO::
10932
10933   Is recursion or external functions supported?
10934
10935 1.  Work-Item ID (1 VGPR)
10936
10937     The X, Y and Z work-item ID are packed into a single VGRP with the following
10938     layout. Only fields actually used by the function are set. The other bits
10939     are undefined.
10940
10941     The values come from the initial kernel execution state. See
10942     :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
10943
10944     .. table:: Work-item implicit argument layout
10945       :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
10946
10947       ======= ======= ==============
10948       Bits    Size    Field Name
10949       ======= ======= ==============
10950       9:0     10 bits X Work-Item ID
10951       19:10   10 bits Y Work-Item ID
10952       29:20   10 bits Z Work-Item ID
10953       31:30   2 bits  Unused
10954       ======= ======= ==============
10955
10956 2.  Dispatch Ptr (2 SGPRs)
10957
10958     The value comes from the initial kernel execution state. See
10959     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10960
10961 3.  Queue Ptr (2 SGPRs)
10962
10963     The value comes from the initial kernel execution state. See
10964     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10965
10966 4.  Kernarg Segment Ptr (2 SGPRs)
10967
10968     The value comes from the initial kernel execution state. See
10969     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10970
10971 5.  Dispatch id (2 SGPRs)
10972
10973     The value comes from the initial kernel execution state. See
10974     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10975
10976 6.  Work-Group ID X (1 SGPR)
10977
10978     The value comes from the initial kernel execution state. See
10979     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10980
10981 7.  Work-Group ID Y (1 SGPR)
10982
10983     The value comes from the initial kernel execution state. See
10984     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10985
10986 8.  Work-Group ID Z (1 SGPR)
10987
10988     The value comes from the initial kernel execution state. See
10989     :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10990
10991 9.  Implicit Argument Ptr (2 SGPRs)
10992
10993     The value is computed by adding an offset to Kernarg Segment Ptr to get the
10994     global address space pointer to the first kernarg implicit argument.
10995
10996 The input and result arguments are assigned in order in the following manner:
10997
10998 .. note::
10999
11000   There are likely some errors and omissions in the following description that
11001   need correction.
11002
11003   .. TODO::
11004
11005     Check the Clang source code to decipher how function arguments and return
11006     results are handled. Also see the AMDGPU specific values used.
11007
11008 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
11009   VGPR31.
11010
11011   If there are more arguments than will fit in these registers, the remaining
11012   arguments are allocated on the stack in order on naturally aligned
11013   addresses.
11014
11015   .. TODO::
11016
11017     How are overly aligned structures allocated on the stack?
11018
11019 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
11020   SGPR29.
11021
11022   If there are more arguments than will fit in these registers, the remaining
11023   arguments are allocated on the stack in order on naturally aligned
11024   addresses.
11025
11026 Note that decomposed struct type arguments may have some fields passed in
11027 registers and some in memory.
11028
11029 .. TODO::
11030
11031   So, a struct which can pass some fields as decomposed register arguments, will
11032   pass the rest as decomposed stack elements? But an argument that will not start
11033   in registers will not be decomposed and will be passed as a non-decomposed
11034   stack value?
11035
11036 The following is not part of the AMDGPU function calling convention but
11037 describes how the AMDGPU implements function calls:
11038
11039 1.  SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
11040     unswizzled scratch address. It is only needed if runtime sized ``alloca``
11041     are used, or for the reasons defined in ``SIFrameLowering``.
11042 2.  Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
11043     to access the incoming stack arguments in the function. The BP is needed
11044     only when the function requires the runtime stack alignment.
11045
11046 3.  Allocating SGPR arguments on the stack are not supported.
11047
11048 4.  No CFI is currently generated. See
11049     :ref:`amdgpu-dwarf-call-frame-information`.
11050
11051     .. note::
11052
11053       CFI will be generated that defines the CFA as the unswizzled address
11054       relative to the wave scratch base in the unswizzled private address space
11055       of the lowest address stack allocated local variable.
11056
11057       ``DW_AT_frame_base`` will be defined as the swizzled address in the
11058       swizzled private address space by dividing the CFA by the wavefront size
11059       (since CFA is always at least dword aligned which matches the scratch
11060       swizzle element size).
11061
11062       If no dynamic stack alignment was performed, the stack allocated arguments
11063       are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
11064       local variables and register spill slots are accessed as positive offsets
11065       relative to ``DW_AT_frame_base``.
11066
11067 5.  Function argument passing is implemented by copying the input physical
11068     registers to virtual registers on entry. The register allocator can spill if
11069     necessary. These are copied back to physical registers at call sites. The
11070     net effect is that each function call can have these values in entirely
11071     distinct locations. The IPRA can help avoid shuffling argument registers.
11072 6.  Call sites are implemented by setting up the arguments at positive offsets
11073     from SP. Then SP is incremented to account for the known frame size before
11074     the call and decremented after the call.
11075
11076     .. note::
11077
11078       The CFI will reflect the changed calculation needed to compute the CFA
11079       from SP.
11080
11081 7.  4 byte spill slots are used in the stack frame. One slot is allocated for an
11082     emergency spill slot. Buffer instructions are used for stack accesses and
11083     not the ``flat_scratch`` instruction.
11084
11085     .. TODO::
11086
11087       Explain when the emergency spill slot is used.
11088
11089 .. TODO::
11090
11091   Possible broken issues:
11092
11093   - Stack arguments must be aligned to required alignment.
11094   - Stack is aligned to max(16, max formal argument alignment)
11095   - Direct argument < 64 bits should check register budget.
11096   - Register budget calculation should respect ``inreg`` for SGPR.
11097   - SGPR overflow is not handled.
11098   - struct with 1 member unpeeling is not checking size of member.
11099   - ``sret`` is after ``this`` pointer.
11100   - Caller is not implementing stack realignment: need an extra pointer.
11101   - Should say AMDGPU passes FP rather than SP.
11102   - Should CFI define CFA as address of locals or arguments. Difference is
11103     apparent when have implemented dynamic alignment.
11104   - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
11105     highest address of stack frame and use negative offset for locals. Would
11106     allow SP to be the same as FP and could support signal-handler-like as now
11107     have a real SP for the top of the stack.
11108   - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
11109     arguments?
11110
11111 AMDPAL
11112 ------
11113
11114 This section provides code conventions used when the target triple OS is
11115 ``amdpal`` (see :ref:`amdgpu-target-triples`).
11116
11117 .. _amdgpu-amdpal-code-object-metadata-section:
11118
11119 Code Object Metadata
11120 ~~~~~~~~~~~~~~~~~~~~
11121
11122 .. note::
11123
11124   The metadata is currently in development and is subject to major
11125   changes. Only the current version is supported. *When this document
11126   was generated the version was 2.6.*
11127
11128 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
11129 record (see :ref:`amdgpu-note-records-v3-v4`).
11130
11131 The metadata is represented as Message Pack formatted binary data (see
11132 [MsgPack]_). The top level is a Message Pack map that includes the keys
11133 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
11134 and referenced tables.
11135
11136 Additional information can be added to the maps. To avoid conflicts, any
11137 key names should be prefixed by "*vendor-name*." where ``vendor-name``
11138 can be the name of the vendor and specific vendor tool that generates the
11139 information. The prefix is abbreviated to simply "." when it appears
11140 within a map that has been added by the same *vendor-name*.
11141
11142   .. table:: AMDPAL Code Object Metadata Map
11143      :name: amdgpu-amdpal-code-object-metadata-map-table
11144
11145      =================== ============== ========= ======================================================================
11146      String Key          Value Type     Required? Description
11147      =================== ============== ========= ======================================================================
11148      "amdpal.version"    sequence of    Required  PAL code object metadata (major, minor) version. The current values
11149                          2 integers               are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
11150      "amdpal.pipelines"  sequence of    Required  Per-pipeline metadata. See
11151                          map                      :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
11152                                                   definition of the keys included in that map.
11153      =================== ============== ========= ======================================================================
11154
11155 ..
11156
11157   .. table:: AMDPAL Code Object Pipeline Metadata Map
11158      :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
11159
11160      ====================================== ============== ========= ===================================================
11161      String Key                             Value Type     Required? Description
11162      ====================================== ============== ========= ===================================================
11163      ".name"                                string                   Source name of the pipeline.
11164      ".type"                                string                   Pipeline type, e.g. VsPs. Values include:
11165
11166                                                                        - "VsPs"
11167                                                                        - "Gs"
11168                                                                        - "Cs"
11169                                                                        - "Ngg"
11170                                                                        - "Tess"
11171                                                                        - "GsTess"
11172                                                                        - "NggTess"
11173
11174      ".internal_pipeline_hash"              sequence of    Required  Internal compiler hash for this pipeline. Lower
11175                                             2 integers               64 bits is the "stable" portion of the hash, used
11176                                                                      for e.g. shader replacement lookup. Upper 64 bits
11177                                                                      is the "unique" portion of the hash, used for
11178                                                                      e.g. pipeline cache lookup. The value is
11179                                                                      implementation defined, and can not be relied on
11180                                                                      between different builds of the compiler.
11181      ".shaders"                             map                      Per-API shader metadata. See
11182                                                                      :ref:`amdgpu-amdpal-code-object-shader-map-table`
11183                                                                      for the definition of the keys included in that
11184                                                                      map.
11185      ".hardware_stages"                     map                      Per-hardware stage metadata. See
11186                                                                      :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
11187                                                                      for the definition of the keys included in that
11188                                                                      map.
11189      ".shader_functions"                    map                      Per-shader function metadata. See
11190                                                                      :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
11191                                                                      for the definition of the keys included in that
11192                                                                      map.
11193      ".registers"                           map            Required  Hardware register configuration. See
11194                                                                      :ref:`amdgpu-amdpal-code-object-register-map-table`
11195                                                                      for the definition of the keys included in that
11196                                                                      map.
11197      ".user_data_limit"                     integer                  Number of user data entries accessed by this
11198                                                                      pipeline.
11199      ".spill_threshold"                     integer                  The user data spill threshold.  0xFFFF for
11200                                                                      NoUserDataSpilling.
11201      ".uses_viewport_array_index"           boolean                  Indicates whether or not the pipeline uses the
11202                                                                      viewport array index feature. Pipelines which use
11203                                                                      this feature can render into all 16 viewports,
11204                                                                      whereas pipelines which do not use it are
11205                                                                      restricted to viewport #0.
11206      ".es_gs_lds_size"                      integer                  Size in bytes of LDS space used internally for
11207                                                                      handling data-passing between the ES and GS
11208                                                                      shader stages. This can be zero if the data is
11209                                                                      passed using off-chip buffers. This value should
11210                                                                      be used to program all user-SGPRs which have been
11211                                                                      marked with "UserDataMapping::EsGsLdsSize"
11212                                                                      (typically only the GS and VS HW stages will ever
11213                                                                      have a user-SGPR so marked).
11214      ".nggSubgroupSize"                     integer                  Explicit maximum subgroup size for NGG shaders
11215                                                                      (maximum number of threads in a subgroup).
11216      ".num_interpolants"                    integer                  Graphics only. Number of PS interpolants.
11217      ".mesh_scratch_memory_size"            integer                  Max mesh shader scratch memory used.
11218      ".api"                                 string                   Name of the client graphics API.
11219      ".api_create_info"                     binary                   Graphics API shader create info binary blob. Can
11220                                                                      be defined by the driver using the compiler if
11221                                                                      they want to be able to correlate API-specific
11222                                                                      information used during creation at a later time.
11223      ====================================== ============== ========= ===================================================
11224
11225 ..
11226
11227   .. table:: AMDPAL Code Object Shader Map
11228      :name: amdgpu-amdpal-code-object-shader-map-table
11229
11230
11231      +-------------+--------------+-------------------------------------------------------------------+
11232      |String Key   |Value Type    |Description                                                        |
11233      +=============+==============+===================================================================+
11234      |- ".compute" |map           |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
11235      |- ".vertex"  |              |for the definition of the keys included in that map.               |
11236      |- ".hull"    |              |                                                                   |
11237      |- ".domain"  |              |                                                                   |
11238      |- ".geometry"|              |                                                                   |
11239      |- ".pixel"   |              |                                                                   |
11240      +-------------+--------------+-------------------------------------------------------------------+
11241
11242 ..
11243
11244   .. table:: AMDPAL Code Object API Shader Metadata Map
11245      :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
11246
11247      ==================== ============== ========= =====================================================================
11248      String Key           Value Type     Required? Description
11249      ==================== ============== ========= =====================================================================
11250      ".api_shader_hash"   sequence of    Required  Input shader hash, typically passed in from the client. The value
11251                           2 integers               is implementation defined, and can not be relied on between
11252                                                    different builds of the compiler.
11253      ".hardware_mapping"  sequence of    Required  Flags indicating the HW stages this API shader maps to. Values
11254                           string                   include:
11255
11256                                                      - ".ls"
11257                                                      - ".hs"
11258                                                      - ".es"
11259                                                      - ".gs"
11260                                                      - ".vs"
11261                                                      - ".ps"
11262                                                      - ".cs"
11263
11264      ==================== ============== ========= =====================================================================
11265
11266 ..
11267
11268   .. table:: AMDPAL Code Object Hardware Stage Map
11269      :name: amdgpu-amdpal-code-object-hardware-stage-map-table
11270
11271      +-------------+--------------+-----------------------------------------------------------------------+
11272      |String Key   |Value Type    |Description                                                            |
11273      +=============+==============+=======================================================================+
11274      |- ".ls"      |map           |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
11275      |- ".hs"      |              |for the definition of the keys included in that map.                   |
11276      |- ".es"      |              |                                                                       |
11277      |- ".gs"      |              |                                                                       |
11278      |- ".vs"      |              |                                                                       |
11279      |- ".ps"      |              |                                                                       |
11280      |- ".cs"      |              |                                                                       |
11281      +-------------+--------------+-----------------------------------------------------------------------+
11282
11283 ..
11284
11285   .. table:: AMDPAL Code Object Hardware Stage Metadata Map
11286      :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
11287
11288      ========================== ============== ========= ===============================================================
11289      String Key                 Value Type     Required? Description
11290      ========================== ============== ========= ===============================================================
11291      ".entry_point"             string                   The ELF symbol pointing to this pipeline's stage entry point.
11292      ".scratch_memory_size"     integer                  Scratch memory size in bytes.
11293      ".lds_size"                integer                  Local Data Share size in bytes.
11294      ".perf_data_buffer_size"   integer                  Performance data buffer size in bytes.
11295      ".vgpr_count"              integer                  Number of VGPRs used.
11296      ".sgpr_count"              integer                  Number of SGPRs used.
11297      ".vgpr_limit"              integer                  If non-zero, indicates the shader was compiled with a
11298                                                          directive to instruct the compiler to limit the VGPR usage to
11299                                                          be less than or equal to the specified value (only set if
11300                                                          different from HW default).
11301      ".sgpr_limit"              integer                  SGPR count upper limit (only set if different from HW
11302                                                          default).
11303      ".threadgroup_dimensions"  sequence of              Thread-group X/Y/Z dimensions (Compute only).
11304                                 3 integers
11305      ".wavefront_size"          integer                  Wavefront size (only set if different from HW default).
11306      ".uses_uavs"               boolean                  The shader reads or writes UAVs.
11307      ".uses_rovs"               boolean                  The shader reads or writes ROVs.
11308      ".writes_uavs"             boolean                  The shader writes to one or more UAVs.
11309      ".writes_depth"            boolean                  The shader writes out a depth value.
11310      ".uses_append_consume"     boolean                  The shader uses append and/or consume operations, either
11311                                                          memory or GDS.
11312      ".uses_prim_id"            boolean                  The shader uses PrimID.
11313      ========================== ============== ========= ===============================================================
11314
11315 ..
11316
11317   .. table:: AMDPAL Code Object Shader Function Map
11318      :name: amdgpu-amdpal-code-object-shader-function-map-table
11319
11320      =============== ============== ====================================================================
11321      String Key      Value Type     Description
11322      =============== ============== ====================================================================
11323      *symbol name*   map            *symbol name* is the ELF symbol name of the shader function code
11324                                     entry address. The value is the function's metadata. See
11325                                     :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
11326      =============== ============== ====================================================================
11327
11328 ..
11329
11330   .. table:: AMDPAL Code Object Shader Function Metadata Map
11331      :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
11332
11333      ============================= ============== =================================================================
11334      String Key                    Value Type     Description
11335      ============================= ============== =================================================================
11336      ".api_shader_hash"            sequence of    Input shader hash, typically passed in from the client. The value
11337                                    2 integers     is implementation defined, and can not be relied on between
11338                                                   different builds of the compiler.
11339      ".scratch_memory_size"        integer        Size in bytes of scratch memory used by the shader.
11340      ".lds_size"                   integer        Size in bytes of LDS memory.
11341      ".vgpr_count"                 integer        Number of VGPRs used by the shader.
11342      ".sgpr_count"                 integer        Number of SGPRs used by the shader.
11343      ".stack_frame_size_in_bytes"  integer        Amount of stack size used by the shader.
11344      ".shader_subtype"             string         Shader subtype/kind. Values include:
11345
11346                                                     - "Unknown"
11347
11348      ============================= ============== =================================================================
11349
11350 ..
11351
11352   .. table:: AMDPAL Code Object Register Map
11353      :name: amdgpu-amdpal-code-object-register-map-table
11354
11355      ========================== ============== ====================================================================
11356      32-bit Integer Key         Value Type     Description
11357      ========================== ============== ====================================================================
11358      ``reg offset``             32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
11359                                                a GRBM register (i.e., driver accessible GPU register number, not
11360                                                shader GPR register number). The driver is required to program each
11361                                                specified register to the corresponding specified value when
11362                                                executing this pipeline. Typically, the ``reg offsets`` are the
11363                                                ``uint16_t`` offsets to each register as defined by the hardware
11364                                                chip headers. The register is set to the provided value. However, a
11365                                                ``reg offset`` that specifies a user data register (e.g.,
11366                                                COMPUTE_USER_DATA_0) needs special treatment. See
11367                                                :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
11368                                                information.
11369      ========================== ============== ====================================================================
11370
11371 .. _amdgpu-amdpal-code-object-user-data-section:
11372
11373 User Data
11374 +++++++++
11375
11376 Each hardware stage has a set of 32-bit physical SPI *user data registers*
11377 (either 16 or 32 based on graphics IP and the stage) which can be
11378 written from a command buffer and then loaded into SGPRs when waves are
11379 launched via a subsequent dispatch or draw operation. This is the way
11380 most arguments are passed from the application/runtime to a hardware
11381 shader.
11382
11383 PAL abstracts this functionality by exposing a set of 128 *user data
11384 entries* per pipeline a client can use to pass arguments from a command
11385 buffer to one or more shaders in that pipeline. The ELF code object must
11386 specify a mapping from virtualized *user data entries* to physical *user
11387 data registers*, and PAL is responsible for implementing that mapping,
11388 including spilling overflow *user data entries* to memory if needed.
11389
11390 Since the *user data registers* are GRBM-accessible SPI registers, this
11391 mapping is actually embedded in the ``.registers`` metadata entry. For
11392 most registers, the value in that map is a literal 32-bit value that
11393 should be written to the register by the driver. However, when the
11394 register is a *user data register* (any USER_DATA register e.g.,
11395 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
11396 the driver to write either a *user data entry* value or one of several
11397 driver-internal values to the register. This encoding is described in
11398 the following table:
11399
11400 .. note::
11401
11402   Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
11403   and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
11404   always be programmed to the address of the GlobalTable, and *user data
11405   register* 1 must always be programmed to the address of the PerShaderTable.
11406
11407 ..
11408
11409   .. table:: AMDPAL User Data Mapping
11410      :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
11411
11412      ==========  =================  ===============================================================================
11413      Value       Name               Description
11414      ==========  =================  ===============================================================================
11415      0..127      *User Data Entry*  32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
11416      0x10000000  GlobalTable        32-bit pointer to GPU memory containing the global internal table (should
11417                                     always point to *user data register* 0).
11418      0x10000001  PerShaderTable     32-bit pointer to GPU memory containing the per-shader internal table. See
11419                                     :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
11420                                     for more detail (should always point to *user data register* 1).
11421      0x10000002  SpillTable         32-bit pointer to GPU memory containing the user data spill table. See
11422                                     :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
11423                                     more detail.
11424      0x10000003  BaseVertex         Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
11425                                     reference the draw index in the vertex shader. Only supported by the first
11426                                     stage in a graphics pipeline.
11427      0x10000004  BaseInstance       Instance offset (32-bit unsigned integer). Only supported by the first stage in
11428                                     a graphics pipeline.
11429      0x10000005  DrawIndex          Draw index (32-bit unsigned integer). Only supported by the first stage in a
11430                                     graphics pipeline.
11431      0x10000006  Workgroup          Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
11432                                     a buffer containing the grid dimensions for a Compute dispatch operation. The
11433                                     high half of the address is stored in the next sequential user-SGPR. Only
11434                                     supported by compute pipelines.
11435      0x1000000A  EsGsLdsSize        Indicates that PAL will program this user-SGPR to contain the amount of LDS
11436                                     space used for the ES/GS pseudo-ring-buffer for passing data between shader
11437                                     stages.
11438      0x1000000B  ViewId             View id (32-bit unsigned integer) identifies a view of graphic
11439                                     pipeline instancing.
11440      0x1000000C  StreamOutTable     32-bit pointer to GPU memory containing the stream out target SRD table.  This
11441                                     can only appear for one shader stage per pipeline.
11442      0x1000000D  PerShaderPerfData  32-bit pointer to GPU memory containing the per-shader performance data buffer.
11443      0x1000000F  VertexBufferTable  32-bit pointer to GPU memory containing the vertex buffer SRD table.  This can
11444                                     only appear for one shader stage per pipeline.
11445      0x10000010  UavExportTable     32-bit pointer to GPU memory containing the UAV export SRD table.  This can
11446                                     only appear for one shader stage per pipeline (PS). These replace color targets
11447                                     and are completely separate from any UAVs used by the shader. This is optional,
11448                                     and only used by the PS when UAV exports are used to replace color-target
11449                                     exports to optimize specific shaders.
11450      0x10000011  NggCullingData     64-bit pointer to GPU memory containing the hardware register data needed by
11451                                     some NGG pipelines to perform culling.  This value contains the address of the
11452                                     first of two consecutive registers which provide the full GPU address.
11453      0x10000015  FetchShaderPtr     64-bit pointer to GPU memory containing the fetch shader subroutine.
11454      ==========  =================  ===============================================================================
11455
11456 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
11457
11458 Per-Shader Table
11459 ################
11460
11461 Low 32 bits of the GPU address for an optional buffer in the ``.data``
11462 section of the ELF. The high 32 bits of the address match the high 32 bits
11463 of the shader's program counter.
11464
11465 The buffer can be anything the shader compiler needs it for, and
11466 allows each shader to have its own region of the ``.data`` section.
11467 Typically, this could be a table of buffer SRD's and the data pointed to
11468 by the buffer SRD's, but it could be a flat-address region of memory as
11469 well. Its layout and usage are defined by the shader compiler.
11470
11471 Each shader's table in the ``.data`` section is referenced by the symbol
11472 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data``  where *xs* corresponds with the
11473 hardware shader stage the data is for. E.g.,
11474 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
11475
11476 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
11477
11478 Spill Table
11479 ###########
11480
11481 It is possible for a hardware shader to need access to more *user data
11482 entries* than there are slots available in user data registers for one
11483 or more hardware shader stages. In that case, the PAL runtime expects
11484 the necessary *user data entries* to be spilled to GPU memory and use
11485 one user data register to point to the spilled user data memory. The
11486 value of the *user data entry* must then represent the location where
11487 a shader expects to read the low 32-bits of the table's GPU virtual
11488 address. The *spill table* itself represents a set of 32-bit values
11489 managed by the PAL runtime in GPU-accessible memory that can be made
11490 indirectly accessible to a hardware shader.
11491
11492 Unspecified OS
11493 --------------
11494
11495 This section provides code conventions used when the target triple OS is
11496 empty (see :ref:`amdgpu-target-triples`).
11497
11498 Trap Handler ABI
11499 ~~~~~~~~~~~~~~~~
11500
11501 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
11502 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
11503 instructions are handled as follows:
11504
11505   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
11506      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
11507
11508      =============== =============== ===========================================
11509      Usage           Code Sequence   Description
11510      =============== =============== ===========================================
11511      llvm.trap       s_endpgm        Causes wavefront to be terminated.
11512      llvm.debugtrap  *none*          Compiler warning given that there is no
11513                                      trap handler installed.
11514      =============== =============== ===========================================
11515
11516 Source Languages
11517 ================
11518
11519 .. _amdgpu-opencl:
11520
11521 OpenCL
11522 ------
11523
11524 When the language is OpenCL the following differences occur:
11525
11526 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
11527 2. The AMDGPU backend appends additional arguments to the kernel's explicit
11528    arguments for the AMDHSA OS (see
11529    :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
11530 3. Additional metadata is generated
11531    (see :ref:`amdgpu-amdhsa-code-object-metadata`).
11532
11533   .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
11534      :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
11535
11536      ======== ==== ========= ===========================================
11537      Position Byte Byte      Description
11538               Size Alignment
11539      ======== ==== ========= ===========================================
11540      1        8    8         OpenCL Global Offset X
11541      2        8    8         OpenCL Global Offset Y
11542      3        8    8         OpenCL Global Offset Z
11543      4        8    8         OpenCL address of printf buffer
11544      5        8    8         OpenCL address of virtual queue used by
11545                              enqueue_kernel.
11546      6        8    8         OpenCL address of AqlWrap struct used by
11547                              enqueue_kernel.
11548      7        8    8         Pointer argument used for Multi-gird
11549                              synchronization.
11550      ======== ==== ========= ===========================================
11551
11552 .. _amdgpu-hcc:
11553
11554 HCC
11555 ---
11556
11557 When the language is HCC the following differences occur:
11558
11559 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
11560
11561 .. _amdgpu-assembler:
11562
11563 Assembler
11564 ---------
11565
11566 AMDGPU backend has LLVM-MC based assembler which is currently in development.
11567 It supports AMDGCN GFX6-GFX10.
11568
11569 This section describes general syntax for instructions and operands.
11570
11571 Instructions
11572 ~~~~~~~~~~~~
11573
11574 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
11575
11576   | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
11577     <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
11578
11579 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
11580 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
11581
11582 The order of operands and modifiers is fixed.
11583 Most modifiers are optional and may be omitted.
11584
11585 Links to detailed instruction syntax description may be found in the following
11586 table. Note that features under development are not included
11587 in this description.
11588
11589     =================================== =======================================
11590     Core ISA                            ISA Extensions
11591     =================================== =======================================
11592     :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`   \-
11593     :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`   \-
11594     :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`   :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
11595
11596                                         :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
11597
11598                                         :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
11599
11600                                         :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
11601
11602                                         :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
11603
11604                                         :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
11605
11606                                         :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
11607
11608     :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
11609
11610                                         :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
11611     =================================== =======================================
11612
11613 For more information about instructions, their semantics and supported
11614 combinations of operands, refer to one of instruction set architecture manuals
11615 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
11616 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_
11617 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_.
11618
11619 Operands
11620 ~~~~~~~~
11621
11622 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
11623
11624 Modifiers
11625 ~~~~~~~~~
11626
11627 Detailed description of modifiers may be found
11628 :doc:`here<AMDGPUModifierSyntax>`.
11629
11630 Instruction Examples
11631 ~~~~~~~~~~~~~~~~~~~~
11632
11633 DS
11634 ++
11635
11636 .. code-block:: nasm
11637
11638   ds_add_u32 v2, v4 offset:16
11639   ds_write_src2_b64 v2 offset0:4 offset1:8
11640   ds_cmpst_f32 v2, v4, v6
11641   ds_min_rtn_f64 v[8:9], v2, v[4:5]
11642
11643 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
11644 Manual.
11645
11646 FLAT
11647 ++++
11648
11649 .. code-block:: nasm
11650
11651   flat_load_dword v1, v[3:4]
11652   flat_store_dwordx3 v[3:4], v[5:7]
11653   flat_atomic_swap v1, v[3:4], v5 glc
11654   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
11655   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
11656
11657 For full list of supported instructions, refer to "FLAT instructions" in ISA
11658 Manual.
11659
11660 MUBUF
11661 +++++
11662
11663 .. code-block:: nasm
11664
11665   buffer_load_dword v1, off, s[4:7], s1
11666   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
11667   buffer_store_format_xy v[1:2], off, s[4:7], s1
11668   buffer_wbinvl1
11669   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
11670
11671 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
11672 Manual.
11673
11674 SMRD/SMEM
11675 +++++++++
11676
11677 .. code-block:: nasm
11678
11679   s_load_dword s1, s[2:3], 0xfc
11680   s_load_dwordx8 s[8:15], s[2:3], s4
11681   s_load_dwordx16 s[88:103], s[2:3], s4
11682   s_dcache_inv_vol
11683   s_memtime s[4:5]
11684
11685 For full list of supported instructions, refer to "Scalar Memory Operations" in
11686 ISA Manual.
11687
11688 SOP1
11689 ++++
11690
11691 .. code-block:: nasm
11692
11693   s_mov_b32 s1, s2
11694   s_mov_b64 s[0:1], 0x80000000
11695   s_cmov_b32 s1, 200
11696   s_wqm_b64 s[2:3], s[4:5]
11697   s_bcnt0_i32_b64 s1, s[2:3]
11698   s_swappc_b64 s[2:3], s[4:5]
11699   s_cbranch_join s[4:5]
11700
11701 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
11702 Manual.
11703
11704 SOP2
11705 ++++
11706
11707 .. code-block:: nasm
11708
11709   s_add_u32 s1, s2, s3
11710   s_and_b64 s[2:3], s[4:5], s[6:7]
11711   s_cselect_b32 s1, s2, s3
11712   s_andn2_b32 s2, s4, s6
11713   s_lshr_b64 s[2:3], s[4:5], s6
11714   s_ashr_i32 s2, s4, s6
11715   s_bfm_b64 s[2:3], s4, s6
11716   s_bfe_i64 s[2:3], s[4:5], s6
11717   s_cbranch_g_fork s[4:5], s[6:7]
11718
11719 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
11720 Manual.
11721
11722 SOPC
11723 ++++
11724
11725 .. code-block:: nasm
11726
11727   s_cmp_eq_i32 s1, s2
11728   s_bitcmp1_b32 s1, s2
11729   s_bitcmp0_b64 s[2:3], s4
11730   s_setvskip s3, s5
11731
11732 For full list of supported instructions, refer to "SOPC Instructions" in ISA
11733 Manual.
11734
11735 SOPP
11736 ++++
11737
11738 .. code-block:: nasm
11739
11740   s_barrier
11741   s_nop 2
11742   s_endpgm
11743   s_waitcnt 0 ; Wait for all counters to be 0
11744   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
11745   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
11746   s_sethalt 9
11747   s_sleep 10
11748   s_sendmsg 0x1
11749   s_sendmsg sendmsg(MSG_INTERRUPT)
11750   s_trap 1
11751
11752 For full list of supported instructions, refer to "SOPP Instructions" in ISA
11753 Manual.
11754
11755 Unless otherwise mentioned, little verification is performed on the operands
11756 of SOPP Instructions, so it is up to the programmer to be familiar with the
11757 range or acceptable values.
11758
11759 VALU
11760 ++++
11761
11762 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
11763 the assembler will automatically use optimal encoding based on its operands. To
11764 force specific encoding, one can add a suffix to the opcode of the instruction:
11765
11766 * _e32 for 32-bit VOP1/VOP2/VOPC
11767 * _e64 for 64-bit VOP3
11768 * _dpp for VOP_DPP
11769 * _sdwa for VOP_SDWA
11770
11771 VOP1/VOP2/VOP3/VOPC examples:
11772
11773 .. code-block:: nasm
11774
11775   v_mov_b32 v1, v2
11776   v_mov_b32_e32 v1, v2
11777   v_nop
11778   v_cvt_f64_i32_e32 v[1:2], v2
11779   v_floor_f32_e32 v1, v2
11780   v_bfrev_b32_e32 v1, v2
11781   v_add_f32_e32 v1, v2, v3
11782   v_mul_i32_i24_e64 v1, v2, 3
11783   v_mul_i32_i24_e32 v1, -3, v3
11784   v_mul_i32_i24_e32 v1, -100, v3
11785   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
11786   v_max_f16_e32 v1, v2, v3
11787
11788 VOP_DPP examples:
11789
11790 .. code-block:: nasm
11791
11792   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
11793   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11794   v_mov_b32 v0, v0 wave_shl:1
11795   v_mov_b32 v0, v0 row_mirror
11796   v_mov_b32 v0, v0 row_bcast:31
11797   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
11798   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11799   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11800
11801 VOP_SDWA examples:
11802
11803 .. code-block:: nasm
11804
11805   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
11806   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
11807   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
11808   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
11809   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
11810
11811 For full list of supported instructions, refer to "Vector ALU instructions".
11812
11813 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
11814
11815 Code Object V2 Predefined Symbols
11816 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11817
11818 .. warning::
11819   Code object V2 is not the default code object version emitted by
11820   this version of LLVM.
11821
11822 The AMDGPU assembler defines and updates some symbols automatically. These
11823 symbols do not affect code generation.
11824
11825 .option.machine_version_major
11826 +++++++++++++++++++++++++++++
11827
11828 Set to the GFX major generation number of the target being assembled for. For
11829 example, when assembling for a "GFX9" target this will be set to the integer
11830 value "9". The possible GFX major generation numbers are presented in
11831 :ref:`amdgpu-processors`.
11832
11833 .option.machine_version_minor
11834 +++++++++++++++++++++++++++++
11835
11836 Set to the GFX minor generation number of the target being assembled for. For
11837 example, when assembling for a "GFX810" target this will be set to the integer
11838 value "1". The possible GFX minor generation numbers are presented in
11839 :ref:`amdgpu-processors`.
11840
11841 .option.machine_version_stepping
11842 ++++++++++++++++++++++++++++++++
11843
11844 Set to the GFX stepping generation number of the target being assembled for.
11845 For example, when assembling for a "GFX704" target this will be set to the
11846 integer value "4". The possible GFX stepping generation numbers are presented
11847 in :ref:`amdgpu-processors`.
11848
11849 .kernel.vgpr_count
11850 ++++++++++++++++++
11851
11852 Set to zero each time a
11853 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
11854 encountered. At each instruction, if the current value of this symbol is less
11855 than or equal to the maximum VGPR number explicitly referenced within that
11856 instruction then the symbol value is updated to equal that VGPR number plus
11857 one.
11858
11859 .kernel.sgpr_count
11860 ++++++++++++++++++
11861
11862 Set to zero each time a
11863 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
11864 encountered. At each instruction, if the current value of this symbol is less
11865 than or equal to the maximum VGPR number explicitly referenced within that
11866 instruction then the symbol value is updated to equal that SGPR number plus
11867 one.
11868
11869 .. _amdgpu-amdhsa-assembler-directives-v2:
11870
11871 Code Object V2 Directives
11872 ~~~~~~~~~~~~~~~~~~~~~~~~~
11873
11874 .. warning::
11875   Code object V2 is not the default code object version emitted by
11876   this version of LLVM.
11877
11878 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
11879 one can specify them with assembler directives.
11880
11881 .hsa_code_object_version major, minor
11882 +++++++++++++++++++++++++++++++++++++
11883
11884 *major* and *minor* are integers that specify the version of the HSA code
11885 object that will be generated by the assembler.
11886
11887 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
11888 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
11889
11890
11891 *major*, *minor*, and *stepping* are all integers that describe the instruction
11892 set architecture (ISA) version of the assembly program.
11893
11894 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
11895 "AMD" and *arch* should always be equal to "AMDGPU".
11896
11897 By default, the assembler will derive the ISA version, *vendor*, and *arch*
11898 from the value of the -mcpu option that is passed to the assembler.
11899
11900 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
11901
11902 .amdgpu_hsa_kernel (name)
11903 +++++++++++++++++++++++++
11904
11905 This directives specifies that the symbol with given name is a kernel entry
11906 point (label) and the object should contain corresponding symbol of type
11907 STT_AMDGPU_HSA_KERNEL.
11908
11909 .amd_kernel_code_t
11910 ++++++++++++++++++
11911
11912 This directive marks the beginning of a list of key / value pairs that are used
11913 to specify the amd_kernel_code_t object that will be emitted by the assembler.
11914 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
11915 amd_kernel_code_t values that are unspecified a default value will be used. The
11916 default value for all keys is 0, with the following exceptions:
11917
11918 - *amd_code_version_major* defaults to 1.
11919 - *amd_kernel_code_version_minor* defaults to 2.
11920 - *amd_machine_kind* defaults to 1.
11921 - *amd_machine_version_major*, *machine_version_minor*, and
11922   *amd_machine_version_stepping* are derived from the value of the -mcpu option
11923   that is passed to the assembler.
11924 - *kernel_code_entry_byte_offset* defaults to 256.
11925 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
11926   defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
11927   Note that wavefront size is specified as a power of two, so a value of **n**
11928   means a size of 2^ **n**.
11929 - *call_convention* defaults to -1.
11930 - *kernarg_segment_alignment*, *group_segment_alignment*, and
11931   *private_segment_alignment* default to 4. Note that alignments are specified
11932   as a power of 2, so a value of **n** means an alignment of 2^ **n**.
11933 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
11934   GFX90A onwards.
11935 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
11936   GFX10 onwards.
11937 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
11938
11939 The *.amd_kernel_code_t* directive must be placed immediately after the
11940 function label and before any instructions.
11941
11942 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
11943 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
11944
11945 .. _amdgpu-amdhsa-assembler-example-v2:
11946
11947 Code Object V2 Example Source Code
11948 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11949
11950 .. warning::
11951   Code Object V2 is not the default code object version emitted by
11952   this version of LLVM.
11953
11954 Here is an example of a minimal assembly source file, defining one HSA kernel:
11955
11956 .. code::
11957    :number-lines:
11958
11959    .hsa_code_object_version 1,0
11960    .hsa_code_object_isa
11961
11962    .hsatext
11963    .globl  hello_world
11964    .p2align 8
11965    .amdgpu_hsa_kernel hello_world
11966
11967    hello_world:
11968
11969       .amd_kernel_code_t
11970          enable_sgpr_kernarg_segment_ptr = 1
11971          is_ptr64 = 1
11972          compute_pgm_rsrc1_vgprs = 0
11973          compute_pgm_rsrc1_sgprs = 0
11974          compute_pgm_rsrc2_user_sgpr = 2
11975          compute_pgm_rsrc1_wgp_mode = 0
11976          compute_pgm_rsrc1_mem_ordered = 0
11977          compute_pgm_rsrc1_fwd_progress = 1
11978      .end_amd_kernel_code_t
11979
11980      s_load_dwordx2 s[0:1], s[0:1] 0x0
11981      v_mov_b32 v0, 3.14159
11982      s_waitcnt lgkmcnt(0)
11983      v_mov_b32 v1, s0
11984      v_mov_b32 v2, s1
11985      flat_store_dword v[1:2], v0
11986      s_endpgm
11987    .Lfunc_end0:
11988         .size   hello_world, .Lfunc_end0-hello_world
11989
11990 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-v4:
11991
11992 Code Object V3 to V4 Predefined Symbols
11993 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11994
11995 The AMDGPU assembler defines and updates some symbols automatically. These
11996 symbols do not affect code generation.
11997
11998 .amdgcn.gfx_generation_number
11999 +++++++++++++++++++++++++++++
12000
12001 Set to the GFX major generation number of the target being assembled for. For
12002 example, when assembling for a "GFX9" target this will be set to the integer
12003 value "9". The possible GFX major generation numbers are presented in
12004 :ref:`amdgpu-processors`.
12005
12006 .amdgcn.gfx_generation_minor
12007 ++++++++++++++++++++++++++++
12008
12009 Set to the GFX minor generation number of the target being assembled for. For
12010 example, when assembling for a "GFX810" target this will be set to the integer
12011 value "1". The possible GFX minor generation numbers are presented in
12012 :ref:`amdgpu-processors`.
12013
12014 .amdgcn.gfx_generation_stepping
12015 +++++++++++++++++++++++++++++++
12016
12017 Set to the GFX stepping generation number of the target being assembled for.
12018 For example, when assembling for a "GFX704" target this will be set to the
12019 integer value "4". The possible GFX stepping generation numbers are presented
12020 in :ref:`amdgpu-processors`.
12021
12022 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
12023
12024 .amdgcn.next_free_vgpr
12025 ++++++++++++++++++++++
12026
12027 Set to zero before assembly begins. At each instruction, if the current value
12028 of this symbol is less than or equal to the maximum VGPR number explicitly
12029 referenced within that instruction then the symbol value is updated to equal
12030 that VGPR number plus one.
12031
12032 May be used to set the `.amdhsa_next_free_vgpr` directive in
12033 :ref:`amdhsa-kernel-directives-table`.
12034
12035 May be set at any time, e.g. manually set to zero at the start of each kernel.
12036
12037 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
12038
12039 .amdgcn.next_free_sgpr
12040 ++++++++++++++++++++++
12041
12042 Set to zero before assembly begins. At each instruction, if the current value
12043 of this symbol is less than or equal the maximum SGPR number explicitly
12044 referenced within that instruction then the symbol value is updated to equal
12045 that SGPR number plus one.
12046
12047 May be used to set the `.amdhsa_next_free_spgr` directive in
12048 :ref:`amdhsa-kernel-directives-table`.
12049
12050 May be set at any time, e.g. manually set to zero at the start of each kernel.
12051
12052 .. _amdgpu-amdhsa-assembler-directives-v3-v4:
12053
12054 Code Object V3 to V4 Directives
12055 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12056
12057 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
12058 architecture processors, and are not OS-specific. Directives which begin with
12059 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
12060 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
12061 :ref:`amdgpu-processors`.
12062
12063 .. _amdgpu-assembler-directive-amdgcn-target:
12064
12065 .amdgcn_target <target-triple> "-" <target-id>
12066 ++++++++++++++++++++++++++++++++++++++++++++++
12067
12068 Optional directive which declares the ``<target-triple>-<target-id>`` supported
12069 by the containing assembler source file. Used by the assembler to validate
12070 command-line options such as ``-triple``, ``-mcpu``, and
12071 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
12072 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
12073
12074 .. note::
12075
12076   The target ID syntax used for code object V2 to V3 for this directive differs
12077   from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
12078
12079 .amdhsa_kernel <name>
12080 +++++++++++++++++++++
12081
12082 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
12083 ``<name>.kd``, in the current location of the current section. Only valid when
12084 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
12085 instruction to execute, and does not need to be previously defined.
12086
12087 Marks the beginning of a list of directives used to generate the bytes of a
12088 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
12089 Directives which may appear in this list are described in
12090 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
12091 be valid for the target being assembled for, and cannot be repeated. Directives
12092 support the range of values specified by the field they reference in
12093 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
12094 assumed to have its default value, unless it is marked as "Required", in which
12095 case it is an error to omit the directive. This list of directives is
12096 terminated by an ``.end_amdhsa_kernel`` directive.
12097
12098   .. table:: AMDHSA Kernel Assembler Directives
12099      :name: amdhsa-kernel-directives-table
12100
12101      ======================================================== =================== ============ ===================
12102      Directive                                                Default             Supported On Description
12103      ======================================================== =================== ============ ===================
12104      ``.amdhsa_group_segment_fixed_size``                     0                   GFX6-GFX10   Controls GROUP_SEGMENT_FIXED_SIZE in
12105                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12106      ``.amdhsa_private_segment_fixed_size``                   0                   GFX6-GFX10   Controls PRIVATE_SEGMENT_FIXED_SIZE in
12107                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12108      ``.amdhsa_kernarg_size``                                 0                   GFX6-GFX10   Controls KERNARG_SIZE in
12109                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12110      ``.amdhsa_user_sgpr_private_segment_buffer``             0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
12111                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12112      ``.amdhsa_user_sgpr_dispatch_ptr``                       0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_PTR in
12113                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12114      ``.amdhsa_user_sgpr_queue_ptr``                          0                   GFX6-GFX10   Controls ENABLE_SGPR_QUEUE_PTR in
12115                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12116      ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                   GFX6-GFX10   Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
12117                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12118      ``.amdhsa_user_sgpr_dispatch_id``                        0                   GFX6-GFX10   Controls ENABLE_SGPR_DISPATCH_ID in
12119                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12120      ``.amdhsa_user_sgpr_flat_scratch_init``                  0                   GFX6-GFX10   Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
12121                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12122      ``.amdhsa_user_sgpr_private_segment_size``               0                   GFX6-GFX10   Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
12123                                                                                                :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12124      ``.amdhsa_wavefront_size32``                             Target              GFX10        Controls ENABLE_WAVEFRONT_SIZE32 in
12125                                                               Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12126                                                               Specific
12127                                                               (wavefrontsize64)
12128      ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                   GFX6-GFX10   Controls ENABLE_PRIVATE_SEGMENT in
12129                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12130      ``.amdhsa_system_sgpr_workgroup_id_x``                   1                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_X in
12131                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12132      ``.amdhsa_system_sgpr_workgroup_id_y``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Y in
12133                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12134      ``.amdhsa_system_sgpr_workgroup_id_z``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_ID_Z in
12135                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12136      ``.amdhsa_system_sgpr_workgroup_info``                   0                   GFX6-GFX10   Controls ENABLE_SGPR_WORKGROUP_INFO in
12137                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12138      ``.amdhsa_system_vgpr_workitem_id``                      0                   GFX6-GFX10   Controls ENABLE_VGPR_WORKITEM_ID in
12139                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12140                                                                                                Possible values are defined in
12141                                                                                                :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
12142      ``.amdhsa_next_free_vgpr``                               Required            GFX6-GFX10   Maximum VGPR number explicitly referenced, plus one.
12143                                                                                                Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
12144                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12145      ``.amdhsa_next_free_sgpr``                               Required            GFX6-GFX10   Maximum SGPR number explicitly referenced, plus one.
12146                                                                                                Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12147                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12148      ``.amdhsa_accum_offset``                                 Required            GFX90A       Offset of a first AccVGPR in the unified register file.
12149                                                                                                Used to calculate ACCUM_OFFSET in
12150                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
12151      ``.amdhsa_reserve_vcc``                                  1                   GFX6-GFX10   Whether the kernel may use the special VCC SGPR.
12152                                                                                                Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12153                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12154      ``.amdhsa_reserve_flat_scratch``                         1                   GFX7-GFX10   Whether the kernel may use flat instructions to access
12155                                                                                                scratch memory. Used to calculate
12156                                                                                                GRANULATED_WAVEFRONT_SGPR_COUNT in
12157                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12158      ``.amdhsa_reserve_xnack_mask``                           Target              GFX8-GFX10   Whether the kernel may trigger XNACK replay.
12159                                                               Feature                          Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12160                                                               Specific                         :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12161                                                               (xnack)
12162      ``.amdhsa_float_round_mode_32``                          0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_32 in
12163                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12164                                                                                                Possible values are defined in
12165                                                                                                :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
12166      ``.amdhsa_float_round_mode_16_64``                       0                   GFX6-GFX10   Controls FLOAT_ROUND_MODE_16_64 in
12167                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12168                                                                                                Possible values are defined in
12169                                                                                                :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
12170      ``.amdhsa_float_denorm_mode_32``                         0                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_32 in
12171                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12172                                                                                                Possible values are defined in
12173                                                                                                :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
12174      ``.amdhsa_float_denorm_mode_16_64``                      3                   GFX6-GFX10   Controls FLOAT_DENORM_MODE_16_64 in
12175                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12176                                                                                                Possible values are defined in
12177                                                                                                :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
12178      ``.amdhsa_dx10_clamp``                                   1                   GFX6-GFX10   Controls ENABLE_DX10_CLAMP in
12179                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12180      ``.amdhsa_ieee_mode``                                    1                   GFX6-GFX10   Controls ENABLE_IEEE_MODE in
12181                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12182      ``.amdhsa_fp16_overflow``                                0                   GFX9-GFX10   Controls FP16_OVFL in
12183                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12184      ``.amdhsa_tg_split``                                     Target              GFX90A       Controls TG_SPLIT in
12185                                                               Feature                          :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
12186                                                               Specific
12187                                                               (tgsplit)
12188      ``.amdhsa_workgroup_processor_mode``                     Target              GFX10        Controls ENABLE_WGP_MODE in
12189                                                               Feature                          :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12190                                                               Specific
12191                                                               (cumode)
12192      ``.amdhsa_memory_ordered``                               1                   GFX10        Controls MEM_ORDERED in
12193                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12194      ``.amdhsa_forward_progress``                             0                   GFX10        Controls FWD_PROGRESS in
12195                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12196      ``.amdhsa_exception_fp_ieee_invalid_op``                 0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
12197                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12198      ``.amdhsa_exception_fp_denorm_src``                      0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
12199                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12200      ``.amdhsa_exception_fp_ieee_div_zero``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
12201                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12202      ``.amdhsa_exception_fp_ieee_overflow``                   0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
12203                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12204      ``.amdhsa_exception_fp_ieee_underflow``                  0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
12205                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12206      ``.amdhsa_exception_fp_ieee_inexact``                    0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
12207                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12208      ``.amdhsa_exception_int_div_zero``                       0                   GFX6-GFX10   Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
12209                                                                                                :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12210      ======================================================== =================== ============ ===================
12211
12212 .amdgpu_metadata
12213 ++++++++++++++++
12214
12215 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
12216 note record (see :ref:`amdgpu-elf-note-records-table-v3-v4`).
12217
12218 The contents must be in the [YAML]_ markup format, with the same structure and
12219 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3` or
12220 :ref:`amdgpu-amdhsa-code-object-metadata-v4`.
12221
12222 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
12223
12224 .. _amdgpu-amdhsa-assembler-example-v3-v4:
12225
12226 Code Object V3 to V4 Example Source Code
12227 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12228
12229 Here is an example of a minimal assembly source file, defining one HSA kernel:
12230
12231 .. code::
12232    :number-lines:
12233
12234    .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
12235
12236    .text
12237    .globl hello_world
12238    .p2align 8
12239    .type hello_world,@function
12240    hello_world:
12241      s_load_dwordx2 s[0:1], s[0:1] 0x0
12242      v_mov_b32 v0, 3.14159
12243      s_waitcnt lgkmcnt(0)
12244      v_mov_b32 v1, s0
12245      v_mov_b32 v2, s1
12246      flat_store_dword v[1:2], v0
12247      s_endpgm
12248    .Lfunc_end0:
12249      .size   hello_world, .Lfunc_end0-hello_world
12250
12251    .rodata
12252    .p2align 6
12253    .amdhsa_kernel hello_world
12254      .amdhsa_user_sgpr_kernarg_segment_ptr 1
12255      .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12256      .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12257    .end_amdhsa_kernel
12258
12259    .amdgpu_metadata
12260    ---
12261    amdhsa.version:
12262      - 1
12263      - 0
12264    amdhsa.kernels:
12265      - .name: hello_world
12266        .symbol: hello_world.kd
12267        .kernarg_segment_size: 48
12268        .group_segment_fixed_size: 0
12269        .private_segment_fixed_size: 0
12270        .kernarg_segment_align: 4
12271        .wavefront_size: 64
12272        .sgpr_count: 2
12273        .vgpr_count: 3
12274        .max_flat_workgroup_size: 256
12275        .args:
12276          - .size: 8
12277            .offset: 0
12278            .value_kind: global_buffer
12279            .address_space: global
12280            .actual_access: write_only
12281    //...
12282    .end_amdgpu_metadata
12283
12284 This kernel is equivalent to the following HIP program:
12285
12286 .. code::
12287    :number-lines:
12288
12289    __global__ void hello_world(float *p) {
12290        *p = 3.14159f;
12291    }
12292
12293 If an assembly source file contains multiple kernels and/or functions, the
12294 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
12295 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
12296 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
12297 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
12298 to group the function with the kernel that calls it and reset the symbols
12299 between the two connected components:
12300
12301 .. code::
12302    :number-lines:
12303
12304    .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
12305
12306    // gpr tracking symbols are implicitly set to zero
12307
12308    .text
12309    .globl kern0
12310    .p2align 8
12311    .type kern0,@function
12312    kern0:
12313      // ...
12314      s_endpgm
12315    .Lkern0_end:
12316      .size   kern0, .Lkern0_end-kern0
12317
12318    .rodata
12319    .p2align 6
12320    .amdhsa_kernel kern0
12321      // ...
12322      .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12323      .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12324    .end_amdhsa_kernel
12325
12326    // reset symbols to begin tracking usage in func1 and kern1
12327    .set .amdgcn.next_free_vgpr, 0
12328    .set .amdgcn.next_free_sgpr, 0
12329
12330    .text
12331    .hidden func1
12332    .global func1
12333    .p2align 2
12334    .type func1,@function
12335    func1:
12336      // ...
12337      s_setpc_b64 s[30:31]
12338    .Lfunc1_end:
12339    .size func1, .Lfunc1_end-func1
12340
12341    .globl kern1
12342    .p2align 8
12343    .type kern1,@function
12344    kern1:
12345      // ...
12346      s_getpc_b64 s[4:5]
12347      s_add_u32 s4, s4, func1@rel32@lo+4
12348      s_addc_u32 s5, s5, func1@rel32@lo+4
12349      s_swappc_b64 s[30:31], s[4:5]
12350      // ...
12351      s_endpgm
12352    .Lkern1_end:
12353      .size   kern1, .Lkern1_end-kern1
12354
12355    .rodata
12356    .p2align 6
12357    .amdhsa_kernel kern1
12358      // ...
12359      .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12360      .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12361    .end_amdhsa_kernel
12362
12363 These symbols cannot identify connected components in order to automatically
12364 track the usage for each kernel. However, in some cases careful organization of
12365 the kernels and functions in the source file means there is minimal additional
12366 effort required to accurately calculate GPR usage.
12367
12368 Additional Documentation
12369 ========================
12370
12371 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
12372 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
12373 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
12374 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
12375 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
12376 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
12377 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
12378 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
12379 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
12380 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
12381 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
12382 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
12383 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
12384 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
12385 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
12386 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
12387 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
12388 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
12389 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
12390 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
12391 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
12392 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
12393 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
12394 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__