docs/AMDGPUUsage.rst

   1 =============================
   2 User Guide for AMDGPU Backend
   3 =============================
   4
   5 .. contents::
   6    :local:
   7
   8 Introduction
   9 ============
  10
  11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
  12 R600 family up until the current GCN families. It lives in the
  13 ``lib/Target/AMDGPU`` directory.
  14
  15 LLVM
  16 ====
  17
  18 .. _amdgpu-target-triples:
  19
  20 Target Triples
  21 --------------
  22
  23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
  24 specify the target triple:
  25
  26   .. table:: AMDGPU Target Triples
  27      :name: amdgpu-target-triples-table
  28
  29      ============ ======== ========= ===========
  30      Architecture Vendor   OS        Environment
  31      ============ ======== ========= ===========
  32      r600         amd      <empty>   <empty>
  33      amdgcn       amd      <empty>   <empty>
  34      amdgcn       amd      amdhsa    <empty>
  35      amdgcn       amd      amdhsa    opencl
  36      amdgcn       amd      amdhsa    amdgizcl
  37      amdgcn       amd      amdhsa    amdgiz
  38      amdgcn       amd      amdhsa    hcc
  39      ============ ======== ========= ===========
  40
  41 ``r600-amd--``
  42   Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
  43   the MESA runtime.
  44
  45 ``amdgcn-amd--``
  46   Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
  47   the MESA runtime.
  48
  49 ``amdgcn-amd-amdhsa-``
  50   Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
  51   compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
  52
  53 ``amdgcn-amd-amdhsa-opencl``
  54   Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
  55   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
  56   :ref:`amdgpu-opencl`.
  57
  58 ``amdgcn-amd-amdhsa-amdgizcl``
  59   Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
  60   is used (see :ref:`amdgpu-address-spaces`).
  61
  62 ``amdgcn-amd-amdhsa-amdgiz``
  63   Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
  64   used (see :ref:`amdgpu-address-spaces`).
  65
  66 ``amdgcn-amd-amdhsa-hcc``
  67   Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
  68   executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
  69   :ref:`amdgpu-hcc`.
  70
  71 .. _amdgpu-processors:
  72
  73 Processors
  74 ----------
  75
  76 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
  77 names from both the *Processor* and *Alternative Processor* can be used.
  78
  79   .. table:: AMDGPU Processors
  80      :name: amdgpu-processors-table
  81
  82      ========== =========== ============ ===== ======= ==================
  83      Processor  Alternative Target       dGPU/ Runtime Example
  84                 Processor   Triple       APU   Support Products
  85                             Architecture
  86      ========== =========== ============ ===== ======= ==================
  87      **R600** [AMD-R6xx]_
  88      --------------------------------------------------------------------
  89      r600                   r600         dGPU
  90      r630                   r600         dGPU
  91      rs880                  r600         dGPU
  92      rv670                  r600         dGPU
  93      **R700** [AMD-R7xx]_
  94      --------------------------------------------------------------------
  95      rv710                  r600         dGPU
  96      rv730                  r600         dGPU
  97      rv770                  r600         dGPU
  98      **Evergreen** [AMD-Evergreen]_
  99      --------------------------------------------------------------------
 100      cedar                  r600         dGPU
 101      redwood                r600         dGPU
 102      sumo                   r600         dGPU
 103      juniper                r600         dGPU
 104      cypress                r600         dGPU
 105      **Northern Islands** [AMD-Cayman-Trinity]_
 106      --------------------------------------------------------------------
 107      barts                  r600         dGPU
 108      turks                  r600         dGPU
 109      caicos                 r600         dGPU
 110      cayman                 r600         dGPU
 111      **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
 112      --------------------------------------------------------------------
 113      gfx600     - tahiti    amdgcn       dGPU
 114      gfx601     - pitcairn  amdgcn       dGPU
 115                 - verde
 116                 - oland
 117                 - hainan
 118      **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
 119      --------------------------------------------------------------------
 120      gfx700     - bonaire   amdgcn       dGPU          - Radeon HD 7790
 121                                                        - Radeon HD 8770
 122                                                        - R7 260
 123                                                        - R7 260X
 124      \          - kaveri    amdgcn       APU           - A6-7000
 125                                                        - A6 Pro-7050B
 126                                                        - A8-7100
 127                                                        - A8 Pro-7150B
 128                                                        - A10-7300
 129                                                        - A10 Pro-7350B
 130                                                        - FX-7500
 131                                                        - A8-7200P
 132                                                        - A10-7400P
 133                                                        - FX-7600P
 134      gfx701     - hawaii    amdgcn       dGPU  ROCm    - FirePro W8100
 135                                                        - FirePro W9100
 136                                                        - FirePro S9150
 137                                                        - FirePro S9170
 138      gfx702                              dGPU  ROCm    - Radeon R9 290
 139                                                        - Radeon R9 290x
 140                                                        - Radeon R390
 141                                                        - Radeon R390x
 142      gfx703     - kabini    amdgcn       APU           - E1-2100
 143                 - mullins                              - E1-2200
 144                                                        - E1-2500
 145                                                        - E2-3000
 146                                                        - E2-3800
 147                                                        - A4-5000
 148                                                        - A4-5100
 149                                                        - A6-5200
 150                                                        - A4 Pro-3340B
 151      **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
 152      --------------------------------------------------------------------
 153      gfx800     - iceland   amdgcn       dGPU          - FirePro S7150
 154                                                        - FirePro S7100
 155                                                        - FirePro W7100
 156                                                        - Radeon R285
 157                                                        - Radeon R9 380
 158                                                        - Radeon R9 385
 159                                                        - Mobile FirePro
 160                                                          M7170
 161      gfx801     - carrizo   amdgcn       APU           - A6-8500P
 162                                                        - Pro A6-8500B
 163                                                        - A8-8600P
 164                                                        - Pro A8-8600B
 165                                                        - FX-8800P
 166                                                        - Pro A12-8800B
 167      \                      amdgcn       APU   ROCm    - A10-8700P
 168                                                        - Pro A10-8700B
 169                                                        - A10-8780P
 170      \                      amdgcn       APU           - A10-9600P
 171                                                        - A10-9630P
 172                                                        - A12-9700P
 173                                                        - A12-9730P
 174                                                        - FX-9800P
 175                                                        - FX-9830P
 176      \                      amdgcn       APU           - E2-9010
 177                                                        - A6-9210
 178                                                        - A9-9410
 179      gfx802     - tonga     amdgcn       dGPU  ROCm    Same as gfx800
 180      gfx803     - fiji      amdgcn       dGPU  ROCm    - Radeon R9 Nano
 181                                                        - Radeon R9 Fury
 182                                                        - Radeon R9 FuryX
 183                                                        - Radeon Pro Duo
 184                                                        - FirePro S9300x2
 185                                                        - Radeon Instinct MI8
 186      \          - polaris10 amdgcn       dGPU  ROCm    - Radeon RX 470
 187                                                        - Radeon RX 480
 188                                                        - Radeon Instinct MI6
 189      \          - polaris11 amdgcn       dGPU  ROCm    - Radeon RX 460
 190      gfx804                 amdgcn       dGPU          Same as gfx803
 191      gfx810     - stoney    amdgcn       APU
 192      **GCN GFX9** [AMD-Vega]_
 193      --------------------------------------------------------------------
 194      gfx900                 amdgcn       dGPU          - Radeon Vega
 195                                                          Frontier Edition
 196                                                        - Radeon RX Vega 56
 197                                                        - Radeon RX Vega 64
 198                                                        - Radeon RX Vega 64
 199                                                          Liquid
 200                                                        - Radeon Instinct MI25
 201      gfx901                 amdgcn       dGPU  ROCm    Same as gfx900
 202                                                        except XNACK is
 203                                                        enabled
 204      gfx902                 amdgcn       APU           *TBA*
 205
 206                                                        .. TODO
 207                                                           Add product
 208                                                           names.
 209      gfx903                 amdgcn       APU           Same as gfx902
 210                                                        except XNACK is
 211                                                        enabled
 212      ========== =========== ============ ===== ======= ==================
 213
 214 .. _amdgpu-address-spaces:
 215
 216 Address Spaces
 217 --------------
 218
 219 The AMDGPU backend uses the following address space mappings.
 220
 221 The memory space names used in the table, aside from the region memory space, is
 222 from the OpenCL standard.
 223
 224 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
 225
 226   .. table:: Address Space Mapping
 227      :name: amdgpu-address-space-mapping-table
 228
 229      ================== ================= ================= ================= =================
 230      LLVM Address Space Memory Space
 231      ------------------ -----------------------------------------------------------------------
 232      \                  Current Default   amdgiz/amdgizcl   hcc               Future Default
 233      ================== ================= ================= ================= =================
 234      0                  Private (Scratch) Generic (Flat)    Generic (Flat)    Generic (Flat)
 235      1                  Global            Global            Global            Global
 236      2                  Constant          Constant          Constant          Region (GDS)
 237      3                  Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
 238      4                  Generic (Flat)    Region (GDS)      Region (GDS)      Constant
 239      5                  Region (GDS)      Private (Scratch) Private (Scratch) Private (Scratch)
 240      ================== ================= ================= ================= =================
 241
 242 Current Default
 243   This is the current default address space mapping used for all languages
 244   except hcc. This will shortly be deprecated.
 245
 246 amdgiz/amdgizcl
 247   This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
 248   is specified as the target triple environment value.
 249
 250 hcc
 251   This is the current address space mapping used when ``hcc`` is specified as
 252   the target triple environment value.This will shortly be deprecated.
 253
 254 Future Default
 255   This will shortly be the only address space mapping for all languages using
 256   AMDGPU backend.
 257
 258 .. _amdgpu-memory-scopes:
 259
 260 Memory Scopes
 261 -------------
 262
 263 This section provides LLVM memory synchronization scopes supported by the AMDGPU
 264 backend memory model when the target triple OS is ``amdhsa`` (see
 265 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
 266
 267 The memory model supported is based on the HSA memory model [HSA]_ which is
 268 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
 269 relation is transitive over the synchonizes-with relation independent of scope,
 270 and synchonizes-with allows the memory scope instances to be inclusive (see
 271 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
 272
 273 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
 274 inclusion and requires the memory scopes to exactly match. However, this
 275 is conservatively correct for OpenCL.
 276
 277   .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
 278      :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
 279
 280      ================ ==========================================================
 281      LLVM Sync Scope  Description
 282      ================ ==========================================================
 283      *none*           The default: ``system``.
 284
 285                       Synchronizes with, and participates in modification and
 286                       seq_cst total orderings with, other operations (except
 287                       image operations) for all address spaces (except private,
 288                       or generic that accesses private) provided the other
 289                       operation's sync scope is:
 290
 291                       - ``system``.
 292                       - ``agent`` and executed by a thread on the same agent.
 293                       - ``workgroup`` and executed by a thread in the same
 294                         workgroup.
 295                       - ``wavefront`` and executed by a thread in the same
 296                         wavefront.
 297
 298      ``agent``        Synchronizes with, and participates in modification and
 299                       seq_cst total orderings with, other operations (except
 300                       image operations) for all address spaces (except private,
 301                       or generic that accesses private) provided the other
 302                       operation's sync scope is:
 303
 304                       - ``system`` or ``agent`` and executed by a thread on the
 305                         same agent.
 306                       - ``workgroup`` and executed by a thread in the same
 307                         workgroup.
 308                       - ``wavefront`` and executed by a thread in the same
 309                         wavefront.
 310
 311      ``workgroup``    Synchronizes with, and participates in modification and
 312                       seq_cst total orderings with, other operations (except
 313                       image operations) for all address spaces (except private,
 314                       or generic that accesses private) provided the other
 315                       operation's sync scope is:
 316
 317                       - ``system``, ``agent`` or ``workgroup`` and executed by a
 318                         thread in the same workgroup.
 319                       - ``wavefront`` and executed by a thread in the same
 320                         wavefront.
 321
 322      ``wavefront``    Synchronizes with, and participates in modification and
 323                       seq_cst total orderings with, other operations (except
 324                       image operations) for all address spaces (except private,
 325                       or generic that accesses private) provided the other
 326                       operation's sync scope is:
 327
 328                       - ``system``, ``agent``, ``workgroup`` or ``wavefront``
 329                         and executed by a thread in the same wavefront.
 330
 331      ``singlethread`` Only synchronizes with, and participates in modification
 332                       and seq_cst total orderings with, other operations (except
 333                       image operations) running in the same thread for all
 334                       address spaces (for example, in signal handlers).
 335      ================ ==========================================================
 336
 337 AMDGPU Intrinsics
 338 -----------------
 339
 340 The AMDGPU backend implements the following intrinsics.
 341
 342 *This section is WIP.*
 343
 344 .. TODO
 345    List AMDGPU intrinsics
 346
 347 Code Object
 348 ===========
 349
 350 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
 351 can be linked by ``lld`` to produce a standard ELF shared code object which can
 352 be loaded and executed on an AMDGPU target.
 353
 354 Header
 355 ------
 356
 357 The AMDGPU backend uses the following ELF header:
 358
 359   .. table:: AMDGPU ELF Header
 360      :name: amdgpu-elf-header-table
 361
 362      ========================== =========================
 363      Field                      Value
 364      ========================== =========================
 365      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
 366      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
 367      ``e_ident[EI_OSABI]``      ``ELFOSABI_AMDGPU_HSA``
 368      ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``
 369      ``e_type``                 ``ET_REL`` or ``ET_DYN``
 370      ``e_machine``              ``EM_AMDGPU``
 371      ``e_entry``                0
 372      ``e_flags``                0
 373      ========================== =========================
 374
 375 ..
 376
 377   .. table:: AMDGPU ELF Header Enumeration Values
 378      :name: amdgpu-elf-header-enumeration-values-table
 379
 380      ============================ =====
 381      Name                         Value
 382      ============================ =====
 383      ``EM_AMDGPU``                224
 384      ``ELFOSABI_AMDGPU_HSA``      64
 385      ``ELFABIVERSION_AMDGPU_HSA`` 1
 386      ============================ =====
 387
 388 ``e_ident[EI_CLASS]``
 389   The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64 bit
 390   applications.
 391
 392 ``e_ident[EI_DATA]``
 393   All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
 394
 395 ``e_ident[EI_OSABI]``
 396   The AMD GPU architecture specific OS ABI of ``ELFOSABI_AMDGPU_HSA`` is used to
 397   specify that the code object conforms to the AMD HSA runtime ABI [HSA]_.
 398
 399 ``e_ident[EI_ABIVERSION]``
 400   The AMD GPU architecture specific OS ABI version of
 401   ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA runtime
 402   ABI to which the code object conforms.
 403
 404 ``e_type``
 405   Can be one of the following values:
 406
 407
 408   ``ET_REL``
 409     The type produced by the AMD GPU backend compiler as it is relocatable code
 410     object.
 411
 412   ``ET_DYN``
 413     The type produced by the linker as it is a shared code object.
 414
 415   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
 416
 417 ``e_machine``
 418   The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
 419   architecture family. The specific member is specified in the
 420   ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
 421   :ref:`amdgpu-note-records`).
 422
 423 ``e_entry``
 424   The entry point is 0 as the entry points for individual kernels must be
 425   selected in order to invoke them through AQL packets.
 426
 427 ``e_flags``
 428   The value is 0 as no flags are used.
 429
 430 Sections
 431 --------
 432
 433 An AMDGPU target ELF code object has the standard ELF sections which include:
 434
 435   .. table:: AMDGPU ELF Sections
 436      :name: amdgpu-elf-sections-table
 437
 438      ================== ================ =================================
 439      Name               Type             Attributes
 440      ================== ================ =================================
 441      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
 442      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 443      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
 444      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
 445      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 446      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 447      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
 448      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
 449      ``.note``          ``SHT_NOTE``     *none*
 450      ``.rela``\ *name*  ``SHT_RELA``     *none*
 451      ``.rela.dyn``      ``SHT_RELA``     *none*
 452      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
 453      ``.shstrtab``      ``SHT_STRTAB``   *none*
 454      ``.strtab``        ``SHT_STRTAB``   *none*
 455      ``.symtab``        ``SHT_SYMTAB``   *none*
 456      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
 457      ================== ================ =================================
 458
 459 These sections have their standard meanings (see [ELF]_) and are only generated
 460 if needed.
 461
 462 ``.debug``\ *\**
 463   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
 464   DWARF produced by the AMDGPU backend.
 465
 466 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
 467   The standard sections used by a dynamic loader.
 468
 469 ``.note``
 470   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
 471   backend.
 472
 473 ``.rela``\ *name*, ``.rela.dyn``
 474   For relocatable code objects, *name* is the name of the section that the
 475   relocation records apply. For example, ``.rela.text`` is the section name for
 476   relocation records associated with the ``.text`` section.
 477
 478   For linked shared code objects, ``.rela.dyn`` contains all the relocation
 479   records from each of the relocatable code object's ``.rela``\ *name* sections.
 480
 481   See :ref:`amdgpu-relocation-records` for the relocation records supported by
 482   the AMDGPU backend.
 483
 484 ``.text``
 485   The executable machine code for the kernels and functions they call. Generated
 486   as position independent code. See :ref:`amdgpu-code-conventions` for
 487   information on conventions used in the isa generation.
 488
 489 .. _amdgpu-note-records:
 490
 491 Note Records
 492 ------------
 493
 494 As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
 495 the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
 496 minimal zero byte padding must be generated to ensure the ``desc`` field size is
 497 a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
 498 be at least 4 to indicate at least 8 byte alignment.
 499
 500 The AMDGPU backend code object uses the following ELF note records in the
 501 ``.note`` section. The *Description* column specifies the layout of the note
 502 record’s ``desc`` field. All fields are consecutive bytes. Note records with
 503 variable size strings have a corresponding ``*_size`` field that specifies the
 504 number of bytes, including the terminating null character, in the string. The
 505 string(s) come immediately after the preceding fields.
 506
 507 Additional note records can be present.
 508
 509   .. table:: AMDGPU ELF Note Records
 510      :name: amdgpu-elf-note-records-table
 511
 512      ===== ============================== ======================================
 513      Name  Type                           Description
 514      ===== ============================== ======================================
 515      "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
 516      "AMD" ``NT_AMD_AMDGPU_ISA``          <isa name null terminated string>
 517      ===== ============================== ======================================
 518
 519 ..
 520
 521   .. table:: AMDGPU ELF Note Record Enumeration Values
 522      :name: amdgpu-elf-note-record-enumeration-values-table
 523
 524      ============================== =====
 525      Name                           Value
 526      ============================== =====
 527      *reserved*                       0-9
 528      ``NT_AMD_AMDGPU_HSA_METADATA``    10
 529      ``NT_AMD_AMDGPU_ISA``             11
 530      ============================== =====
 531
 532 ``NT_AMD_AMDGPU_ISA``
 533   Specifies the instruction set architecture used by the machine code contained
 534   in the code object.
 535
 536   This note record is required for code objects containing machine code for
 537   processors matching the ``amdgcn`` architecture in table
 538   :ref:`amdgpu-processors`.
 539
 540   The null terminated string has the following syntax:
 541
 542     *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
 543
 544   where:
 545
 546     *architecture*
 547       The architecture from table :ref:`amdgpu-target-triples-table`.
 548
 549       This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
 550       :ref:`amdgpu-target-triples`).
 551
 552     *vendor*
 553       The vendor from table :ref:`amdgpu-target-triples-table`.
 554
 555       For the AMDGPU backend this is always ``amd``.
 556
 557     *os*
 558       The OS from table :ref:`amdgpu-target-triples-table`.
 559
 560     *environment*
 561       An environment from table :ref:`amdgpu-target-triples-table`, or blank if
 562       the environment has no affect on the execution of the code object.
 563
 564       For the AMDGPU backend this is currently always blank.
 565     *processor*
 566       The processor from table :ref:`amdgpu-processors-table`.
 567
 568   For example:
 569
 570     ``amdgcn-amd-amdhsa--gfx901``
 571
 572 ``NT_AMD_AMDGPU_HSA_METADATA``
 573   Specifies extensible metadata associated with the code objects executed on HSA
 574   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
 575   the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
 576   :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
 577   object metadata string.
 578
 579 .. _amdgpu-symbols:
 580
 581 Symbols
 582 -------
 583
 584 Symbols include the following:
 585
 586   .. table:: AMDGPU ELF Symbols
 587      :name: amdgpu-elf-symbols-table
 588
 589      ===================== ============== ============= ==================
 590      Name                  Type           Section       Description
 591      ===================== ============== ============= ==================
 592      *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
 593                                           - ``.rodata``
 594                                           - ``.bss``
 595      *link-name*\ ``@kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
 596      *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
 597      ===================== ============== ============= ==================
 598
 599 Global variable
 600   Global variables both used and defined by the compilation unit.
 601
 602   If the symbol is defined in the compilation unit then it is allocated in the
 603   appropriate section according to if it has initialized data or is readonly.
 604
 605   If the symbol is external then its section is ``STN_UNDEF`` and the loader
 606   will resolve relocations using the definition provided by another code object
 607   or explicitly defined by the runtime.
 608
 609   All global symbols, whether defined in the compilation unit or external, are
 610   accessed by the machine code indirectly through a GOT table entry. This
 611   allows them to be preemptable. The GOT table is only supported when the target
 612   triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 613
 614   .. TODO
 615      Add description of linked shared object symbols. Seems undefined symbols
 616      are marked as STT_NOTYPE.
 617
 618 Kernel descriptor
 619   Every HSA kernel has an associated kernel descriptor. It is the address of the
 620   kernel descriptor that is used in the AQL dispatch packet used to invoke the
 621   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
 622   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
 623
 624 Kernel entry point
 625   Every HSA kernel also has a symbol for its machine code entry point.
 626
 627 .. _amdgpu-relocation-records:
 628
 629 Relocation Records
 630 ------------------
 631
 632 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
 633 relocatable fields are:
 634
 635 ``word32``
 636   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
 637   alignment. These values use the same byte order as other word values in the
 638   AMD GPU architecture.
 639
 640 ``word64``
 641   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
 642   alignment. These values use the same byte order as other word values in the
 643   AMD GPU architecture.
 644
 645 Following notations are used for specifying relocation calculations:
 646
 647 **A**
 648   Represents the addend used to compute the value of the relocatable field.
 649
 650 **G**
 651   Represents the offset into the global offset table at which the relocation
 652   entry’s symbol will reside during execution.
 653
 654 **GOT**
 655   Represents the address of the global offset table.
 656
 657 **P**
 658   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
 659   of the storage unit being relocated (computed using ``r_offset``).
 660
 661 **S**
 662   Represents the value of the symbol whose index resides in the relocation
 663   entry.
 664
 665 The following relocation types are supported:
 666
 667   .. table:: AMDGPU ELF Relocation Records
 668      :name: amdgpu-elf-relocation-records-table
 669
 670      ==========================  =====  ==========  ==============================
 671      Relocation Type             Value  Field       Calculation
 672      ==========================  =====  ==========  ==============================
 673      ``R_AMDGPU_NONE``           0      *none*      *none*
 674      ``R_AMDGPU_ABS32_LO``       1      ``word32``  (S + A) & 0xFFFFFFFF
 675      ``R_AMDGPU_ABS32_HI``       2      ``word32``  (S + A) >> 32
 676      ``R_AMDGPU_ABS64``          3      ``word64``  S + A
 677      ``R_AMDGPU_REL32``          4      ``word32``  S + A - P
 678      ``R_AMDGPU_REL64``          5      ``word64``  S + A - P
 679      ``R_AMDGPU_ABS32``          6      ``word32``  S + A
 680      ``R_AMDGPU_GOTPCREL``       7      ``word32``  G + GOT + A - P
 681      ``R_AMDGPU_GOTPCREL32_LO``  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
 682      ``R_AMDGPU_GOTPCREL32_HI``  9      ``word32``  (G + GOT + A - P) >> 32
 683      ``R_AMDGPU_REL32_LO``       10     ``word32``  (S + A - P) & 0xFFFFFFFF
 684      ``R_AMDGPU_REL32_HI``       11     ``word32``  (S + A - P) >> 32
 685      ==========================  =====  ==========  ==============================
 686
 687 .. _amdgpu-dwarf:
 688
 689 DWARF
 690 -----
 691
 692 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
 693 information that maps the code object executable code and data to the source
 694 language constructs. It can be used by tools such as debuggers and profilers.
 695
 696 Address Space Mapping
 697 ~~~~~~~~~~~~~~~~~~~~~
 698
 699 The following address space mapping is used:
 700
 701   .. table:: AMDGPU DWARF Address Space Mapping
 702      :name: amdgpu-dwarf-address-space-mapping-table
 703
 704      =================== =================
 705      DWARF Address Space Memory Space
 706      =================== =================
 707      1                   Private (Scratch)
 708      2                   Local (group/LDS)
 709      *omitted*           Global
 710      *omitted*           Constant
 711      *omitted*           Generic (Flat)
 712      *not supported*     Region (GDS)
 713      =================== =================
 714
 715 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
 716 used in the table.
 717
 718 An ``address_class`` attribute is generated on pointer type DIEs to specify the
 719 DWARF address space of the value of the pointer when it is in the *private* or
 720 *local* address space. Otherwise the attribute is omitted.
 721
 722 An ``XDEREF`` operation is generated in location list expressions for variables
 723 that are allocated in the *private* and *local* address space. Otherwise no
 724 ``XDREF`` is omitted.
 725
 726 Register Mapping
 727 ~~~~~~~~~~~~~~~~
 728
 729 *This section is WIP.*
 730
 731 .. TODO
 732    Define DWARF register enumeration.
 733
 734    If want to present a wavefront state then should expose vector registers as
 735    64 wide (rather than per work-item view that LLVM uses). Either as separate
 736    registers, or a 64x4 byte single register. In either case use a new LANE op
 737    (akin to XDREF) to select the current lane usage in a location
 738    expression. This would also allow scalar register spilling to vector register
 739    lanes to be expressed (currently no debug information is being generated for
 740    spilling). If choose a wide single register approach then use LANE in
 741    conjunction with PIECE operation to select the dword part of the register for
 742    the current lane. If the separate register approach then use LANE to select
 743    the register.
 744
 745 Source Text
 746 ~~~~~~~~~~~
 747
 748 *This section is WIP.*
 749
 750 .. TODO
 751    DWARF extension to include runtime generated source text.
 752
 753 .. _amdgpu-code-conventions:
 754
 755 Code Conventions
 756 ================
 757
 758 This section provides code conventions used for each supported target triple OS
 759 (see :ref:`amdgpu-target-triples`).
 760
 761 AMDHSA
 762 ------
 763
 764 This section provides code conventions used when the target triple OS is
 765 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
 766
 767 .. _amdgpu-amdhsa-hsa-code-object-metadata:
 768
 769 Code Object Metadata
 770 ~~~~~~~~~~~~~~~~~~~~
 771
 772 The code object metadata specifies extensible metadata associated with the code
 773 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
 774 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
 775 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
 776 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
 777 information necessary to support the ROCM kernel queries. For example, the
 778 segment sizes needed in a dispatch packet. In addition, a high level language
 779 runtime may require other information to be included. For example, the AMD
 780 OpenCL runtime records kernel argument information.
 781
 782 The metadata is specified as a YAML formatted string (see [YAML]_ and
 783 :doc:`YamlIO`).
 784
 785 .. TODO
 786    Is the string null terminated? It probably should not if YAML allows it to
 787    contain null characters, otherwise it should be.
 788
 789 The metadata is represented as a single YAML document comprised of the mapping
 790 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
 791 referenced tables.
 792
 793 For boolean values, the string values of ``false`` and ``true`` are used for
 794 false and true respectively.
 795
 796 Additional information can be added to the mappings. To avoid conflicts, any
 797 non-AMD key names should be prefixed by "*vendor-name*.".
 798
 799   .. table:: AMDHSA Code Object Metadata Mapping
 800      :name: amdgpu-amdhsa-code-object-metadata-mapping-table
 801
 802      ========== ============== ========= =======================================
 803      String Key Value Type     Required? Description
 804      ========== ============== ========= =======================================
 805      "Version"  sequence of    Required  - The first integer is the major
 806                 2 integers                 version. Currently 1.
 807                                          - The second integer is the minor
 808                                            version. Currently 0.
 809      "Printf"   sequence of              Each string is encoded information
 810                 strings                  about a printf function call. The
 811                                          encoded information is organized as
 812                                          fields separated by colon (':'):
 813
 814                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
 815
 816                                          where:
 817
 818                                          ``ID``
 819                                            A 32 bit integer as a unique id for
 820                                            each printf function call
 821
 822                                          ``N``
 823                                            A 32 bit integer equal to the number
 824                                            of arguments of printf function call
 825                                            minus 1
 826
 827                                          ``S[i]`` (where i = 0, 1, ... , N-1)
 828                                            32 bit integers for the size in bytes
 829                                            of the i-th FormatString argument of
 830                                            the printf function call
 831
 832                                          FormatString
 833                                            The format string passed to the
 834                                            printf function call.
 835      "Kernels"  sequence of    Required  Sequence of the mappings for each
 836                 mapping                  kernel in the code object. See
 837                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
 838                                          for the definition of the mapping.
 839      ========== ============== ========= =======================================
 840
 841 ..
 842
 843   .. table:: AMDHSA Code Object Kernel Metadata Mapping
 844      :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
 845
 846      ================= ============== ========= ================================
 847      String Key        Value Type     Required? Description
 848      ================= ============== ========= ================================
 849      "Name"            string         Required  Source name of the kernel.
 850      "SymbolName"      string         Required  Name of the kernel
 851                                                 descriptor ELF symbol.
 852      "Language"        string                   Source language of the kernel.
 853                                                 Values include:
 854
 855                                                 - "OpenCL C"
 856                                                 - "OpenCL C++"
 857                                                 - "HCC"
 858                                                 - "OpenMP"
 859
 860      "LanguageVersion" sequence of              - The first integer is the major
 861                        2 integers                 version.
 862                                                 - The second integer is the
 863                                                   minor version.
 864      "Attrs"           mapping                  Mapping of kernel attributes.
 865                                                 See
 866                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
 867                                                 for the mapping definition.
 868      "Arguments"       sequence of              Sequence of mappings of the
 869                        mapping                  kernel arguments. See
 870                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
 871                                                 for the definition of the mapping.
 872      "CodeProps"       mapping                  Mapping of properties related to
 873                                                 the kernel code. See
 874                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
 875                                                 for the mapping definition.
 876      "DebugProps"      mapping                  Mapping of properties related to
 877                                                 the kernel debugging. See
 878                                                 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
 879                                                 for the mapping definition.
 880      ================= ============== ========= ================================
 881
 882 ..
 883
 884   .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
 885      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
 886
 887      =================== ============== ========= ==============================
 888      String Key          Value Type     Required? Description
 889      =================== ============== ========= ==============================
 890      "ReqdWorkGroupSize" sequence of              The dispatch work-group size
 891                          3 integers               X, Y, Z must correspond to the
 892                                                   specified values.
 893
 894                                                   Corresponds to the OpenCL
 895                                                   ``reqd_work_group_size``
 896                                                   attribute.
 897      "WorkGroupSizeHint" sequence of              The dispatch work-group size
 898                          3 integers               X, Y, Z is likely to be the
 899                                                   specified values.
 900
 901                                                   Corresponds to the OpenCL
 902                                                   ``work_group_size_hint``
 903                                                   attribute.
 904      "VecTypeHint"       string                   The name of a scalar or vector
 905                                                   type.
 906
 907                                                   Corresponds to the OpenCL
 908                                                   ``vec_type_hint`` attribute.
 909      =================== ============== ========= ==============================
 910
 911 ..
 912
 913   .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
 914      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
 915
 916      ================= ============== ========= ================================
 917      String Key        Value Type     Required? Description
 918      ================= ============== ========= ================================
 919      "Name"            string                   Kernel argument name.
 920      "TypeName"        string                   Kernel argument type name.
 921      "Size"            integer        Required  Kernel argument size in bytes.
 922      "Align"           integer        Required  Kernel argument alignment in
 923                                                 bytes. Must be a power of two.
 924      "ValueKind"       string         Required  Kernel argument kind that
 925                                                 specifies how to set up the
 926                                                 corresponding argument.
 927                                                 Values include:
 928
 929                                                 "ByValue"
 930                                                   The argument is copied
 931                                                   directly into the kernarg.
 932
 933                                                 "GlobalBuffer"
 934                                                   A global address space pointer
 935                                                   to the buffer data is passed
 936                                                   in the kernarg.
 937
 938                                                 "DynamicSharedPointer"
 939                                                   A group address space pointer
 940                                                   to dynamically allocated LDS
 941                                                   is passed in the kernarg.
 942
 943                                                 "Sampler"
 944                                                   A global address space
 945                                                   pointer to a S# is passed in
 946                                                   the kernarg.
 947
 948                                                 "Image"
 949                                                   A global address space
 950                                                   pointer to a T# is passed in
 951                                                   the kernarg.
 952
 953                                                 "Pipe"
 954                                                   A global address space pointer
 955                                                   to an OpenCL pipe is passed in
 956                                                   the kernarg.
 957
 958                                                 "Queue"
 959                                                   A global address space pointer
 960                                                   to an OpenCL device enqueue
 961                                                   queue is passed in the
 962                                                   kernarg.
 963
 964                                                 "HiddenGlobalOffsetX"
 965                                                   The OpenCL grid dispatch
 966                                                   global offset for the X
 967                                                   dimension is passed in the
 968                                                   kernarg.
 969
 970                                                 "HiddenGlobalOffsetY"
 971                                                   The OpenCL grid dispatch
 972                                                   global offset for the Y
 973                                                   dimension is passed in the
 974                                                   kernarg.
 975
 976                                                 "HiddenGlobalOffsetZ"
 977                                                   The OpenCL grid dispatch
 978                                                   global offset for the Z
 979                                                   dimension is passed in the
 980                                                   kernarg.
 981
 982                                                 "HiddenNone"
 983                                                   An argument that is not used
 984                                                   by the kernel. Space needs to
 985                                                   be left for it, but it does
 986                                                   not need to be set up.
 987
 988                                                 "HiddenPrintfBuffer"
 989                                                   A global address space pointer
 990                                                   to the runtime printf buffer
 991                                                   is passed in kernarg.
 992
 993                                                 "HiddenDefaultQueue"
 994                                                   A global address space pointer
 995                                                   to the OpenCL device enqueue
 996                                                   queue that should be used by
 997                                                   the kernel by default is
 998                                                   passed in the kernarg.
 999
1000                                                 "HiddenCompletionAction"
1001                                                   *TBD*
1002
1003                                                   .. TODO
1004                                                      Add description.
1005
1006      "ValueType"       string         Required  Kernel argument value type. Only
1007                                                 present if "ValueKind" is
1008                                                 "ByValue". For vector data
1009                                                 types, the value is for the
1010                                                 element type. Values include:
1011
1012                                                 - "Struct"
1013                                                 - "I8"
1014                                                 - "U8"
1015                                                 - "I16"
1016                                                 - "U16"
1017                                                 - "F16"
1018                                                 - "I32"
1019                                                 - "U32"
1020                                                 - "F32"
1021                                                 - "I64"
1022                                                 - "U64"
1023                                                 - "F64"
1024
1025                                                 .. TODO
1026                                                    How can it be determined if a
1027                                                    vector type, and what size
1028                                                    vector?
1029      "PointeeAlign"    integer                  Alignment in bytes of pointee
1030                                                 type for pointer type kernel
1031                                                 argument. Must be a power
1032                                                 of 2. Only present if
1033                                                 "ValueKind" is
1034                                                 "DynamicSharedPointer".
1035      "AddrSpaceQual"   string                   Kernel argument address space
1036                                                 qualifier. Only present if
1037                                                 "ValueKind" is "GlobalBuffer" or
1038                                                 "DynamicSharedPointer". Values
1039                                                 are:
1040
1041                                                 - "Private"
1042                                                 - "Global"
1043                                                 - "Constant"
1044                                                 - "Local"
1045                                                 - "Generic"
1046                                                 - "Region"
1047
1048                                                 .. TODO
1049                                                    Is GlobalBuffer only Global
1050                                                    or Constant? Is
1051                                                    DynamicSharedPointer always
1052                                                    Local? Can HCC allow Generic?
1053                                                    How can Private or Region
1054                                                    ever happen?
1055      "AccQual"         string                   Kernel argument access
1056                                                 qualifier. Only present if
1057                                                 "ValueKind" is "Image" or
1058                                                 "Pipe". Values
1059                                                 are:
1060
1061                                                 - "ReadOnly"
1062                                                 - "WriteOnly"
1063                                                 - "ReadWrite"
1064
1065                                                 .. TODO
1066                                                    Does this apply to
1067                                                    GlobalBuffer?
1068      "ActualAcc"       string                   The actual memory accesses
1069                                                 performed by the kernel on the
1070                                                 kernel argument. Only present if
1071                                                 "ValueKind" is "GlobalBuffer",
1072                                                 "Image", or "Pipe". This may be
1073                                                 more restrictive than indicated
1074                                                 by "AccQual" to reflect what the
1075                                                 kernel actual does. If not
1076                                                 present then the runtime must
1077                                                 assume what is implied by
1078                                                 "AccQual" and "IsConst". Values
1079                                                 are:
1080
1081                                                 - "ReadOnly"
1082                                                 - "WriteOnly"
1083                                                 - "ReadWrite"
1084
1085      "IsConst"         boolean                  Indicates if the kernel argument
1086                                                 is const qualified. Only present
1087                                                 if "ValueKind" is
1088                                                 "GlobalBuffer".
1089
1090      "IsRestrict"      boolean                  Indicates if the kernel argument
1091                                                 is restrict qualified. Only
1092                                                 present if "ValueKind" is
1093                                                 "GlobalBuffer".
1094
1095      "IsVolatile"      boolean                  Indicates if the kernel argument
1096                                                 is volatile qualified. Only
1097                                                 present if "ValueKind" is
1098                                                 "GlobalBuffer".
1099
1100      "IsPipe"          boolean                  Indicates if the kernel argument
1101                                                 is pipe qualified. Only present
1102                                                 if "ValueKind" is "Pipe".
1103
1104                                                 .. TODO
1105                                                    Can GlobalBuffer be pipe
1106                                                    qualified?
1107      ================= ============== ========= ================================
1108
1109 ..
1110
1111   .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1112      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1113
1114      ============================ ============== ========= =====================
1115      String Key                   Value Type     Required? Description
1116      ============================ ============== ========= =====================
1117      "KernargSegmentSize"         integer        Required  The size in bytes of
1118                                                            the kernarg segment
1119                                                            that holds the values
1120                                                            of the arguments to
1121                                                            the kernel.
1122      "GroupSegmentFixedSize"      integer        Required  The amount of group
1123                                                            segment memory
1124                                                            required by a
1125                                                            work-group in
1126                                                            bytes. This does not
1127                                                            include any
1128                                                            dynamically allocated
1129                                                            group segment memory
1130                                                            that may be added
1131                                                            when the kernel is
1132                                                            dispatched.
1133      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1134                                                            private address space
1135                                                            memory required for a
1136                                                            work-item in
1137                                                            bytes. If
1138                                                            IsDynamicCallstack
1139                                                            is 1 then additional
1140                                                            space must be added
1141                                                            to this value for the
1142                                                            call stack.
1143      "KernargSegmentAlign"        integer        Required  The maximum byte
1144                                                            alignment of
1145                                                            arguments in the
1146                                                            kernarg segment. Must
1147                                                            be a power of 2.
1148      "WavefrontSize"              integer        Required  Wavefront size. Must
1149                                                            be a power of 2.
1150      "NumSGPRs"                   integer                  Number of scalar
1151                                                            registers used by a
1152                                                            wavefront for
1153                                                            GFX6-GFX9. This
1154                                                            includes the special
1155                                                            SGPRs for VCC, Flat
1156                                                            Scratch (GFX7-GFX9)
1157                                                            and XNACK (for
1158                                                            GFX8-GFX9). It does
1159                                                            not include the 16
1160                                                            SGPR added if a trap
1161                                                            handler is
1162                                                            enabled. It is not
1163                                                            rounded up to the
1164                                                            allocation
1165                                                            granularity.
1166      "NumVGPRs"                   integer                  Number of vector
1167                                                            registers used by
1168                                                            each work-item for
1169                                                            GFX6-GFX9
1170      "MaxFlatWorkgroupSize"       integer                  Maximum flat
1171                                                            work-group size
1172                                                            supported by the
1173                                                            kernel in work-items.
1174      "IsDynamicCallStack"         boolean                  Indicates if the
1175                                                            generated machine
1176                                                            code is using a
1177                                                            dynamically sized
1178                                                            call stack.
1179      "IsXNACKEnabled"             boolean                  Indicates if the
1180                                                            generated machine
1181                                                            code is capable of
1182                                                            supporting XNACK.
1183      ============================ ============== ========= =====================
1184
1185 ..
1186
1187   .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1188      :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1189
1190      =================================== ============== ========= ==============
1191      String Key                          Value Type     Required? Description
1192      =================================== ============== ========= ==============
1193      "DebuggerABIVersion"                string
1194      "ReservedNumVGPRs"                  integer
1195      "ReservedFirstVGPR"                 integer
1196      "PrivateSegmentBufferSGPR"          integer
1197      "WavefrontPrivateSegmentOffsetSGPR" integer
1198      =================================== ============== ========= ==============
1199
1200 .. TODO
1201    Plan to remove the debug properties metadata.
1202
1203 Kernel Dispatch
1204 ~~~~~~~~~~~~~~~
1205
1206 The HSA architected queuing language (AQL) defines a user space memory interface
1207 that can be used to control the dispatch of kernels, in an agent independent
1208 way. An agent can have zero or more AQL queues created for it using the ROCm
1209 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1210 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1211 mechanics and packet layouts.
1212
1213 The packet processor of a kernel agent is responsible for detecting and
1214 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1215 packet processor is implemented by the hardware command processor (CP),
1216 asynchronous dispatch controller (ADC) and shader processor input controller
1217 (SPI).
1218
1219 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1220 mode driver to initialize and register the AQL queue with CP.
1221
1222 To dispatch a kernel the following actions are performed. This can occur in the
1223 CPU host program, or from an HSA kernel executing on a GPU.
1224
1225 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1226    executed is obtained.
1227 2. A pointer to the kernel descriptor (see
1228    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1229    obtained. It must be for a kernel that is contained in a code object that that
1230    was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1231    associated.
1232 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1233    for a memory region with the kernarg property for the kernel agent that will
1234    execute the kernel. It must be at least 16 byte aligned.
1235 4. Kernel argument values are assigned to the kernel argument memory
1236    allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1237    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1238    memory in the same way constant memory is accessed. (Note that the HSA
1239    specification allows an implementation to copy the kernel argument contents to
1240    another location that is accessed by the kernel.)
1241 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1242    api uses 64 bit atomic operations to reserve space in the AQL queue for the
1243    packet. The packet must be set up, and the final write must use an atomic
1244    store release to set the packet kind to ensure the packet contents are
1245    visible to the kernel agent. AQL defines a doorbell signal mechanism to
1246    notify the kernel agent that the AQL queue has been updated. These rules, and
1247    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1248    System Architecture Specification* [HSA]_.
1249 6. A kernel dispatch packet includes information about the actual dispatch,
1250    such as grid and work-group size, together with information from the code
1251    object about the kernel, such as segment sizes. The ROCm runtime queries on
1252    the kernel symbol can be used to obtain the code object values which are
1253    recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1254 7. CP executes micro-code and is responsible for detecting and setting up the
1255    GPU to execute the wavefronts of a kernel dispatch.
1256 8. CP ensures that when the a wavefront starts executing the kernel machine
1257    code, the scalar general purpose registers (SGPR) and vector general purpose
1258    registers (VGPR) are set up as required by the machine code. The required
1259    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1260    register state is defined in
1261    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1262 9. The prolog of the kernel machine code (see
1263    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1264    before continuing executing the machine code that corresponds to the kernel.
1265 10. When the kernel dispatch has completed execution, CP signals the completion
1266     signal specified in the kernel dispatch packet if not 0.
1267
1268 .. _amdgpu-amdhsa-memory-spaces:
1269
1270 Memory Spaces
1271 ~~~~~~~~~~~~~
1272
1273 The memory space properties are:
1274
1275   .. table:: AMDHSA Memory Spaces
1276      :name: amdgpu-amdhsa-memory-spaces-table
1277
1278      ================= =========== ======== ======= ==================
1279      Memory Space Name HSA Segment Hardware Address NULL Value
1280                        Name        Name     Size
1281      ================= =========== ======== ======= ==================
1282      Private           private     scratch  32      0x00000000
1283      Local             group       LDS      32      0xFFFFFFFF
1284      Global            global      global   64      0x0000000000000000
1285      Constant          constant    *same as 64      0x0000000000000000
1286                                    global*
1287      Generic           flat        flat     64      0x0000000000000000
1288      Region            N/A         GDS      32      *not implemented
1289                                                     for AMDHSA*
1290      ================= =========== ======== ======= ==================
1291
1292 The global and constant memory spaces both use global virtual addresses, which
1293 are the same virtual address space used by the CPU. However, some virtual
1294 addresses may only be accessible to the CPU, some only accessible by the GPU,
1295 and some by both.
1296
1297 Using the constant memory space indicates that the data will not change during
1298 the execution of the kernel. This allows scalar read instructions to be
1299 used. The vector and scalar L1 caches are invalidated of volatile data before
1300 each kernel dispatch execution to allow constant memory to change values between
1301 kernel dispatches.
1302
1303 The local memory space uses the hardware Local Data Store (LDS) which is
1304 automatically allocated when the hardware creates work-groups of wavefronts, and
1305 freed when all the wavefronts of a work-group have terminated. The data store
1306 (DS) instructions can be used to access it.
1307
1308 The private memory space uses the hardware scratch memory support. If the kernel
1309 uses scratch, then the hardware allocates memory that is accessed using
1310 wavefront lane dword (4 byte) interleaving. The mapping used from private
1311 address to physical address is:
1312
1313   ``wavefront-scratch-base +
1314   (private-address * wavefront-size * 4) +
1315   (wavefront-lane-id * 4)``
1316
1317 There are different ways that the wavefront scratch base address is determined
1318 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1319 memory can be accessed in an interleaved manner using buffer instruction with
1320 the scratch buffer descriptor and per wave scratch offset, by the scratch
1321 instructions, or by flat instructions. If each lane of a wavefront accesses the
1322 same private address, the interleaving results in adjacent dwords being accessed
1323 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1324 supported except by flat and scratch instructions in GFX9.
1325
1326 The generic address space uses the hardware flat address support available in
1327 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1328 local appertures), that are outside the range of addressible global memory, to
1329 map from a flat address to a private or local address.
1330
1331 FLAT instructions can take a flat address and access global, private (scratch)
1332 and group (LDS) memory depending in if the address is within one of the
1333 apperture ranges. Flat access to scratch requires hardware aperture setup and
1334 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1335 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1336 (see :ref:`amdgpu-amdhsa-m0`).
1337
1338 To convert between a segment address and a flat address the base address of the
1339 appertures address can be used. For GFX7-GFX8 these are available in the
1340 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1341 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1342 GFX9 the appature base addresses are directly available as inline constant
1343 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1344 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1345 which makes it easier to convert from flat to segment or segment to flat.
1346
1347 Image and Samplers
1348 ~~~~~~~~~~~~~~~~~~
1349
1350 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1351 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1352 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1353 enumeration values for the queries that are not trivially deducible from the S#
1354 representation.
1355
1356 HSA Signals
1357 ~~~~~~~~~~~
1358
1359 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1360 structure allocated in memory accessible from both the CPU and GPU. The
1361 structure is defined by the ROCm runtime and subject to change between releases
1362 (see [AMD-ROCm-github]_).
1363
1364 .. _amdgpu-amdhsa-hsa-aql-queue:
1365
1366 HSA AQL Queue
1367 ~~~~~~~~~~~~~
1368
1369 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1370 between releases (see [AMD-ROCm-github]_). For some processors it contains
1371 fields needed to implement certain language features such as the flat address
1372 aperture bases. It also contains fields used by CP such as managing the
1373 allocation of scratch memory.
1374
1375 .. _amdgpu-amdhsa-kernel-descriptor:
1376
1377 Kernel Descriptor
1378 ~~~~~~~~~~~~~~~~~
1379
1380 A kernel descriptor consists of the information needed by CP to initiate the
1381 execution of a kernel, including the entry point address of the machine code
1382 that implements the kernel.
1383
1384 Kernel Descriptor for GFX6-GFX9
1385 +++++++++++++++++++++++++++++++
1386
1387 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1388
1389   .. table:: Kernel Descriptor for GFX6-GFX9
1390      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1391
1392      ======= ======= =============================== ===========================
1393      Bits    Size    Field Name                      Description
1394      ======= ======= =============================== ===========================
1395      31:0    4 bytes group_segment_fixed_size        The amount of fixed local
1396                                                      address space memory
1397                                                      required for a work-group
1398                                                      in bytes. This does not
1399                                                      include any dynamically
1400                                                      allocated local address
1401                                                      space memory that may be
1402                                                      added when the kernel is
1403                                                      dispatched.
1404      63:32   4 bytes private_segment_fixed_size      The amount of fixed
1405                                                      private address space
1406                                                      memory required for a
1407                                                      work-item in bytes. If
1408                                                      is_dynamic_callstack is 1
1409                                                      then additional space must
1410                                                      be added to this value for
1411                                                      the call stack.
1412      95:64   4 bytes max_flat_workgroup_size         Maximum flat work-group
1413                                                      size supported by the
1414                                                      kernel in work-items.
1415      96      1 bit   is_dynamic_call_stack           Indicates if the generated
1416                                                      machine code is using a
1417                                                      dynamically sized call
1418                                                      stack.
1419      97      1 bit   is_xnack_enabled                Indicates if the generated
1420                                                      machine code is capable of
1421                                                      suppoting XNACK.
1422      127:98  30 bits                                 Reserved. Must be 0.
1423      191:128 8 bytes kernel_code_entry_byte_offset   Byte offset (possibly
1424                                                      negative) from base
1425                                                      address of kernel
1426                                                      descriptor to kernel's
1427                                                      entry point instruction
1428                                                      which must be 256 byte
1429                                                      aligned.
1430      383:192 24                                      Reserved. Must be 0.
1431              bytes
1432      415:384 4 bytes compute_pgm_rsrc1               Compute Shader (CS)
1433                                                      program settings used by
1434                                                      CP to set up
1435                                                      ``COMPUTE_PGM_RSRC1``
1436                                                      configuration
1437                                                      register. See
1438                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1439      447:416 4 bytes compute_pgm_rsrc2               Compute Shader (CS)
1440                                                      program settings used by
1441                                                      CP to set up
1442                                                      ``COMPUTE_PGM_RSRC2``
1443                                                      configuration
1444                                                      register. See
1445                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1446      448     1 bit   enable_sgpr_private_segment     Enable the setup of the
1447                      _buffer                         SGPR user data registers
1448                                                      (see
1449                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1450
1451                                                      The total number of SGPR
1452                                                      user data registers
1453                                                      requested must not exceed
1454                                                      16 and match value in
1455                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1456                                                      Any requests beyond 16
1457                                                      will be ignored.
1458      449     1 bit   enable_sgpr_dispatch_ptr        *see above*
1459      450     1 bit   enable_sgpr_queue_ptr           *see above*
1460      451     1 bit   enable_sgpr_kernarg_segment_ptr *see above*
1461      452     1 bit   enable_sgpr_dispatch_id         *see above*
1462      453     1 bit   enable_sgpr_flat_scratch_init   *see above*
1463      454     1 bit   enable_sgpr_private_segment     *see above*
1464                      _size
1465      455     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
1466                      _count_X                        should always be 0.
1467      456     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
1468                      _count_Y                        should always be 0.
1469      457     1 bit   enable_sgpr_grid_workgroup      Not implemented in CP and
1470                      _count_Z                        should always be 0.
1471      463:458 6 bits                                  Reserved. Must be 0.
1472      511:464 4                                       Reserved. Must be 0.
1473              bytes
1474      512     **Total size 64 bytes.**
1475      ======= ===================================================================
1476
1477 ..
1478
1479   .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1480      :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1481
1482      ======= ======= =============================== ===========================================================================
1483      Bits    Size    Field Name                      Description
1484      ======= ======= =============================== ===========================================================================
1485      5:0     6 bits  granulated_workitem_vgpr_count  Number of vector registers
1486                                                      used by each work-item,
1487                                                      granularity is device
1488                                                      specific:
1489
1490                                                      GFX6-9
1491                                                        roundup((max-vgpg + 1)
1492                                                        / 4) - 1
1493
1494                                                      Used by CP to set up
1495                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
1496      9:6     4 bits  granulated_wavefront_sgpr_count Number of scalar registers
1497                                                      used by a wavefront,
1498                                                      granularity is device
1499                                                      specific:
1500
1501                                                      GFX6-8
1502                                                        roundup((max-sgpg + 1)
1503                                                        / 8) - 1
1504                                                      GFX9
1505                                                        roundup((max-sgpg + 1)
1506                                                        / 16) - 1
1507
1508                                                      Includes the special SGPRs
1509                                                      for VCC, Flat Scratch (for
1510                                                      GFX7 onwards) and XNACK
1511                                                      (for GFX8 onwards). It does
1512                                                      not include the 16 SGPR
1513                                                      added if a trap handler is
1514                                                      enabled.
1515
1516                                                      Used by CP to set up
1517                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
1518      11:10   2 bits  priority                        Must be 0.
1519
1520                                                      Start executing wavefront
1521                                                      at the specified priority.
1522
1523                                                      CP is responsible for
1524                                                      filling in
1525                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
1526      13:12   2 bits  float_mode_round_32             Wavefront starts execution
1527                                                      with specified rounding
1528                                                      mode for single (32
1529                                                      bit) floating point
1530                                                      precision floating point
1531                                                      operations.
1532
1533                                                      Floating point rounding
1534                                                      mode values are defined in
1535                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1536
1537                                                      Used by CP to set up
1538                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1539      15:14   2 bits  float_mode_round_16_64          Wavefront starts execution
1540                                                      with specified rounding
1541                                                      denorm mode for half/double (16
1542                                                      and 64 bit) floating point
1543                                                      precision floating point
1544                                                      operations.
1545
1546                                                      Floating point rounding
1547                                                      mode values are defined in
1548                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1549
1550                                                      Used by CP to set up
1551                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1552      17:16   2 bits  float_mode_denorm_32            Wavefront starts execution
1553                                                      with specified denorm mode
1554                                                      for single (32
1555                                                      bit)  floating point
1556                                                      precision floating point
1557                                                      operations.
1558
1559                                                      Floating point denorm mode
1560                                                      values are defined in
1561                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1562
1563                                                      Used by CP to set up
1564                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1565      19:18   2 bits  float_mode_denorm_16_64         Wavefront starts execution
1566                                                      with specified denorm mode
1567                                                      for half/double (16
1568                                                      and 64 bit) floating point
1569                                                      precision floating point
1570                                                      operations.
1571
1572                                                      Floating point denorm mode
1573                                                      values are defined in
1574                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1575
1576                                                      Used by CP to set up
1577                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1578      20      1 bit   priv                            Must be 0.
1579
1580                                                      Start executing wavefront
1581                                                      in privilege trap handler
1582                                                      mode.
1583
1584                                                      CP is responsible for
1585                                                      filling in
1586                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
1587      21      1 bit   enable_dx10_clamp               Wavefront starts execution
1588                                                      with DX10 clamp mode
1589                                                      enabled. Used by the vector
1590                                                      ALU to force DX-10 style
1591                                                      treatment of NaN's (when
1592                                                      set, clamp NaN to zero,
1593                                                      otherwise pass NaN
1594                                                      through).
1595
1596                                                      Used by CP to set up
1597                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1598      22      1 bit   debug_mode                      Must be 0.
1599
1600                                                      Start executing wavefront
1601                                                      in single step mode.
1602
1603                                                      CP is responsible for
1604                                                      filling in
1605                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1606      23      1 bit   enable_ieee_mode                Wavefront starts execution
1607                                                      with IEEE mode
1608                                                      enabled. Floating point
1609                                                      opcodes that support
1610                                                      exception flag gathering
1611                                                      will quiet and propagate
1612                                                      signaling-NaN inputs per
1613                                                      IEEE 754-2008. Min_dx10 and
1614                                                      max_dx10 become IEEE
1615                                                      754-2008 compliant due to
1616                                                      signaling-NaN propagation
1617                                                      and quieting.
1618
1619                                                      Used by CP to set up
1620                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1621      24      1 bit   bulky                           Must be 0.
1622
1623                                                      Only one work-group allowed
1624                                                      to execute on a compute
1625                                                      unit.
1626
1627                                                      CP is responsible for
1628                                                      filling in
1629                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
1630      25      1 bit   cdbg_user                       Must be 0.
1631
1632                                                      Flag that can be used to
1633                                                      control debugging code.
1634
1635                                                      CP is responsible for
1636                                                      filling in
1637                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1638      31:26   6 bits                                  Reserved. Must be 0.
1639      32      **Total size 4 bytes**
1640      ======= ===================================================================================================================
1641
1642 ..
1643
1644   .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1645      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1646
1647      ======= ======= =============================== ===========================================================================
1648      Bits    Size    Field Name                      Description
1649      ======= ======= =============================== ===========================================================================
1650      0       1 bit   enable_sgpr_private_segment     Enable the setup of the
1651                      _wave_offset                    SGPR wave scratch offset
1652                                                      system register (see
1653                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1654
1655                                                      Used by CP to set up
1656                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1657      5:1     5 bits  user_sgpr_count                 The total number of SGPR
1658                                                      user data registers
1659                                                      requested. This number must
1660                                                      match the number of user
1661                                                      data registers enabled.
1662
1663                                                      Used by CP to set up
1664                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1665      6       1 bit   enable_trap_handler             Set to 1 if code contains a
1666                                                      TRAP instruction which
1667                                                      requires a trap handler to
1668                                                      be enabled.
1669
1670                                                      CP sets
1671                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1672                                                      if the runtime has
1673                                                      installed a trap handler
1674                                                      regardless of the setting
1675                                                      of this field.
1676      7       1 bit   enable_sgpr_workgroup_id_x      Enable the setup of the
1677                                                      system SGPR register for
1678                                                      the work-group id in the X
1679                                                      dimension (see
1680                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1681
1682                                                      Used by CP to set up
1683                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1684      8       1 bit   enable_sgpr_workgroup_id_y      Enable the setup of the
1685                                                      system SGPR register for
1686                                                      the work-group id in the Y
1687                                                      dimension (see
1688                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1689
1690                                                      Used by CP to set up
1691                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1692      9       1 bit   enable_sgpr_workgroup_id_z      Enable the setup of the
1693                                                      system SGPR register for
1694                                                      the work-group id in the Z
1695                                                      dimension (see
1696                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1697
1698                                                      Used by CP to set up
1699                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1700      10      1 bit   enable_sgpr_workgroup_info      Enable the setup of the
1701                                                      system SGPR register for
1702                                                      work-group information (see
1703                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1704
1705                                                      Used by CP to set up
1706                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1707      12:11   2 bits  enable_vgpr_workitem_id         Enable the setup of the
1708                                                      VGPR system registers used
1709                                                      for the work-item ID.
1710                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1711                                                      defines the values.
1712
1713                                                      Used by CP to set up
1714                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1715      13      1 bit   enable_exception_address_watch  Must be 0.
1716
1717                                                      Wavefront starts execution
1718                                                      with address watch
1719                                                      exceptions enabled which
1720                                                      are generated when L1 has
1721                                                      witnessed a thread access
1722                                                      an *address of
1723                                                      interest*.
1724
1725                                                      CP is responsible for
1726                                                      filling in the address
1727                                                      watch bit in
1728                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1729                                                      according to what the
1730                                                      runtime requests.
1731      14      1 bit   enable_exception_memory         Must be 0.
1732
1733                                                      Wavefront starts execution
1734                                                      with memory violation
1735                                                      exceptions exceptions
1736                                                      enabled which are generated
1737                                                      when a memory violation has
1738                                                      occurred for this wave from
1739                                                      L1 or LDS
1740                                                      (write-to-read-only-memory,
1741                                                      mis-aligned atomic, LDS
1742                                                      address out of range,
1743                                                      illegal address, etc.).
1744
1745                                                      CP sets the memory
1746                                                      violation bit in
1747                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1748                                                      according to what the
1749                                                      runtime requests.
1750      23:15   9 bits  granulated_lds_size             Must be 0.
1751
1752                                                      CP uses the rounded value
1753                                                      from the dispatch packet,
1754                                                      not this value, as the
1755                                                      dispatch may contain
1756                                                      dynamically allocated group
1757                                                      segment memory. CP writes
1758                                                      directly to
1759                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1760
1761                                                      Amount of group segment
1762                                                      (LDS) to allocate for each
1763                                                      work-group. Granularity is
1764                                                      device specific:
1765
1766                                                      GFX6:
1767                                                        roundup(lds-size / (64 * 4))
1768                                                      GFX7-GFX9:
1769                                                        roundup(lds-size / (128 * 4))
1770
1771      24      1 bit   enable_exception_ieee_754_fp    Wavefront starts execution
1772                      _invalid_operation              with specified exceptions
1773                                                      enabled.
1774
1775                                                      Used by CP to set up
1776                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
1777                                                      (set from bits 0..6).
1778
1779                                                      IEEE 754 FP Invalid
1780                                                      Operation
1781      25      1 bit   enable_exception_fp_denormal    FP Denormal one or more
1782                      _source                         input operands is a
1783                                                      denormal number
1784      26      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Division by
1785                      _division_by_zero               Zero
1786      27      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP FP Overflow
1787                      _overflow
1788      28      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Underflow
1789                      _underflow
1790      29      1 bit   enable_exception_ieee_754_fp    IEEE 754 FP Inexact
1791                      _inexact
1792      30      1 bit   enable_exception_int_divide_by  Integer Division by Zero
1793                      _zero                           (rcp_iflag_f32 instruction
1794                                                      only)
1795      31      1 bit                                   Reserved. Must be 0.
1796      32      **Total size 4 bytes.**
1797      ======= ===================================================================================================================
1798
1799 ..
1800
1801   .. table:: Floating Point Rounding Mode Enumeration Values
1802      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1803
1804      ===================================== ===== ===============================
1805      Enumeration Name                      Value Description
1806      ===================================== ===== ===============================
1807      AMD_FLOAT_ROUND_MODE_NEAR_EVEN        0     Round Ties To Even
1808      AMD_FLOAT_ROUND_MODE_PLUS_INFINITY    1     Round Toward +infinity
1809      AMD_FLOAT_ROUND_MODE_MINUS_INFINITY   2     Round Toward -infinity
1810      AMD_FLOAT_ROUND_MODE_ZERO             3     Round Toward 0
1811      ===================================== ===== ===============================
1812
1813 ..
1814
1815   .. table:: Floating Point Denorm Mode Enumeration Values
1816      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1817
1818      ===================================== ===== ===============================
1819      Enumeration Name                      Value Description
1820      ===================================== ===== ===============================
1821      AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST   0     Flush Source and Destination
1822                                                  Denorms
1823      AMD_FLOAT_DENORM_MODE_FLUSH_DST       1     Flush Output Denorms
1824      AMD_FLOAT_DENORM_MODE_FLUSH_SRC       2     Flush Source Denorms
1825      AMD_FLOAT_DENORM_MODE_FLUSH_NONE      3     No Flush
1826      ===================================== ===== ===============================
1827
1828 ..
1829
1830   .. table:: System VGPR Work-Item ID Enumeration Values
1831      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1832
1833      ===================================== ===== ===============================
1834      Enumeration Name                      Value Description
1835      ===================================== ===== ===============================
1836      AMD_SYSTEM_VGPR_WORKITEM_ID_X         0     Set work-item X dimension ID.
1837      AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y       1     Set work-item X and Y
1838                                                  dimensions ID.
1839      AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z     2     Set work-item X, Y and Z
1840                                                  dimensions ID.
1841      AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3     Undefined.
1842      ===================================== ===== ===============================
1843
1844 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1845
1846 Initial Kernel Execution State
1847 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1848
1849 This section defines the register state that will be set up by the packet
1850 processor prior to the start of execution of every wavefront. This is limited by
1851 the constraints of the hardware controllers of CP/ADC/SPI.
1852
1853 The order of the SGPR registers is defined, but the compiler can specify which
1854 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1855 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1856 for enabled registers are dense starting at SGPR0: the first enabled register is
1857 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1858 an SGPR number.
1859
1860 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1861 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1862 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1863 initialized. These are then immediately followed by the System SGPRs that are
1864 set up by ADC/SPI and can have different values for each wave of the grid
1865 dispatch.
1866
1867 SGPR register initial state is defined in
1868 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1869
1870   .. table:: SGPR Register Set Up Order
1871      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1872
1873      ========== ========================== ====== ==============================
1874      SGPR Order Name                       Number Description
1875                 (kernel descriptor enable  of
1876                 field)                     SGPRs
1877      ========== ========================== ====== ==============================
1878      First      Private Segment Buffer     4      V# that can be used, together
1879                 (enable_sgpr_private              with Scratch Wave Offset as an
1880                 _segment_buffer)                  offset, to access the private
1881                                                   memory space using a segment
1882                                                   address.
1883
1884                                                   CP uses the value provided by
1885                                                   the runtime.
1886      then       Dispatch Ptr               2      64 bit address of AQL dispatch
1887                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
1888                                                   actually executing.
1889      then       Queue Ptr                  2      64 bit address of amd_queue_t
1890                 (enable_sgpr_queue_ptr)           object for AQL queue on which
1891                                                   the dispatch packet was
1892                                                   queued.
1893      then       Kernarg Segment Ptr        2      64 bit address of Kernarg
1894                 (enable_sgpr_kernarg              segment. This is directly
1895                 _segment_ptr)                     copied from the
1896                                                   kernarg_address in the kernel
1897                                                   dispatch packet.
1898
1899                                                   Having CP load it once avoids
1900                                                   loading it at the beginning of
1901                                                   every wavefront.
1902      then       Dispatch Id                2      64 bit Dispatch ID of the
1903                 (enable_sgpr_dispatch_id)         dispatch packet being
1904                                                   executed.
1905      then       Flat Scratch Init          2      This is 2 SGPRs:
1906                 (enable_sgpr_flat_scratch
1907                 _init)                            GFX6
1908                                                     Not supported.
1909                                                   GFX7-GFX8
1910                                                     The first SGPR is a 32 bit
1911                                                     byte offset from
1912                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1913                                                     to per SPI base of memory
1914                                                     for scratch for the queue
1915                                                     executing the kernel
1916                                                     dispatch. CP obtains this
1917                                                     from the runtime. (The
1918                                                     Scratch Segment Buffer base
1919                                                     address is
1920                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1921                                                     plus this offset.) The value
1922                                                     of Scratch Wave Offset must
1923                                                     be added to this offset by
1924                                                     the kernel machine code,
1925                                                     right shifted by 8, and
1926                                                     moved to the FLAT_SCRATCH_HI
1927                                                     SGPR register.
1928                                                     FLAT_SCRATCH_HI corresponds
1929                                                     to SGPRn-4 on GFX7, and
1930                                                     SGPRn-6 on GFX8 (where SGPRn
1931                                                     is the highest numbered SGPR
1932                                                     allocated to the wave).
1933                                                     FLAT_SCRATCH_HI is
1934                                                     multiplied by 256 (as it is
1935                                                     in units of 256 bytes) and
1936                                                     added to
1937                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1938                                                     to calculate the per wave
1939                                                     FLAT SCRATCH BASE in flat
1940                                                     memory instructions that
1941                                                     access the scratch
1942                                                     apperture.
1943
1944                                                     The second SGPR is 32 bit
1945                                                     byte size of a single
1946                                                     work-item’s scratch memory
1947                                                     usage. CP obtains this from
1948                                                     the runtime, and it is
1949                                                     always a multiple of DWORD.
1950                                                     CP checks that the value in
1951                                                     the kernel dispatch packet
1952                                                     Private Segment Byte Size is
1953                                                     not larger, and requests the
1954                                                     runtime to increase the
1955                                                     queue's scratch size if
1956                                                     necessary. The kernel code
1957                                                     must move it to
1958                                                     FLAT_SCRATCH_LO which is
1959                                                     SGPRn-3 on GFX7 and SGPRn-5
1960                                                     on GFX8. FLAT_SCRATCH_LO is
1961                                                     used as the FLAT SCRATCH
1962                                                     SIZE in flat memory
1963                                                     instructions. Having CP load
1964                                                     it once avoids loading it at
1965                                                     the beginning of every
1966                                                     wavefront. GFX9 This is the
1967                                                     64 bit base address of the
1968                                                     per SPI scratch backing
1969                                                     memory managed by SPI for
1970                                                     the queue executing the
1971                                                     kernel dispatch. CP obtains
1972                                                     this from the runtime (and
1973                                                     divides it if there are
1974                                                     multiple Shader Arrays each
1975                                                     with its own SPI). The value
1976                                                     of Scratch Wave Offset must
1977                                                     be added by the kernel
1978                                                     machine code and the result
1979                                                     moved to the FLAT_SCRATCH
1980                                                     SGPR which is SGPRn-6 and
1981                                                     SGPRn-5. It is used as the
1982                                                     FLAT SCRATCH BASE in flat
1983                                                     memory instructions. then
1984                                                     Private Segment Size 1 The
1985                                                     32 bit byte size of a
1986                                                     (enable_sgpr_private single
1987                                                     work-item's
1988                                                     scratch_segment_size) memory
1989                                                     allocation. This is the
1990                                                     value from the kernel
1991                                                     dispatch packet Private
1992                                                     Segment Byte Size rounded up
1993                                                     by CP to a multiple of
1994                                                     DWORD.
1995
1996                                                   Having CP load it once avoids
1997                                                   loading it at the beginning of
1998                                                   every wavefront.
1999
2000                                                   This is not used for
2001                                                   GFX7-GFX8 since it is the same
2002                                                   value as the second SGPR of
2003                                                   Flat Scratch Init. However, it
2004                                                   may be needed for GFX9 which
2005                                                   changes the meaning of the
2006                                                   Flat Scratch Init value.
2007      then       Grid Work-Group Count X    1      32 bit count of the number of
2008                 (enable_sgpr_grid                 work-groups in the X dimension
2009                 _workgroup_count_X)               for the grid being
2010                                                   executed. Computed from the
2011                                                   fields in the kernel dispatch
2012                                                   packet as ((grid_size.x +
2013                                                   workgroup_size.x - 1) /
2014                                                   workgroup_size.x).
2015      then       Grid Work-Group Count Y    1      32 bit count of the number of
2016                 (enable_sgpr_grid                 work-groups in the Y dimension
2017                 _workgroup_count_Y &&             for the grid being
2018                 less than 16 previous             executed. Computed from the
2019                 SGPRs)                            fields in the kernel dispatch
2020                                                   packet as ((grid_size.y +
2021                                                   workgroup_size.y - 1) /
2022                                                   workgroupSize.y).
2023
2024                                                   Only initialized if <16
2025                                                   previous SGPRs initialized.
2026      then       Grid Work-Group Count Z    1      32 bit count of the number of
2027                 (enable_sgpr_grid                 work-groups in the Z dimension
2028                 _workgroup_count_Z &&             for the grid being
2029                 less than 16 previous             executed. Computed from the
2030                 SGPRs)                            fields in the kernel dispatch
2031                                                   packet as ((grid_size.z +
2032                                                   workgroup_size.z - 1) /
2033                                                   workgroupSize.z).
2034
2035                                                   Only initialized if <16
2036                                                   previous SGPRs initialized.
2037      then       Work-Group Id X            1      32 bit work-group id in X
2038                 (enable_sgpr_workgroup_id         dimension of grid for
2039                 _X)                               wavefront.
2040      then       Work-Group Id Y            1      32 bit work-group id in Y
2041                 (enable_sgpr_workgroup_id         dimension of grid for
2042                 _Y)                               wavefront.
2043      then       Work-Group Id Z            1      32 bit work-group id in Z
2044                 (enable_sgpr_workgroup_id         dimension of grid for
2045                 _Z)                               wavefront.
2046      then       Work-Group Info            1      {first_wave, 14’b0000,
2047                 (enable_sgpr_workgroup            ordered_append_term[10:0],
2048                 _info)                            threadgroup_size_in_waves[5:0]}
2049      then       Scratch Wave Offset        1      32 bit byte offset from base
2050                 (enable_sgpr_private              of scratch base of queue
2051                 _segment_wave_offset)             executing the kernel
2052                                                   dispatch. Must be used as an
2053                                                   offset with Private
2054                                                   segment address when using
2055                                                   Scratch Segment Buffer. It
2056                                                   must be used to set up FLAT
2057                                                   SCRATCH for flat addressing
2058                                                   (see
2059                                                   :ref:`amdgpu-amdhsa-flat-scratch`).
2060      ========== ========================== ====== ==============================
2061
2062 The order of the VGPR registers is defined, but the compiler can specify which
2063 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2064 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2065 for enabled registers are dense starting at VGPR0: the first enabled register is
2066 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2067 VGPR number.
2068
2069 VGPR register initial state is defined in
2070 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2071
2072   .. table:: VGPR Register Set Up Order
2073      :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2074
2075      ========== ========================== ====== ==============================
2076      VGPR Order Name                       Number Description
2077                 (kernel descriptor enable  of
2078                 field)                     VGPRs
2079      ========== ========================== ====== ==============================
2080      First      Work-Item Id X             1      32 bit work item id in X
2081                 (Always initialized)              dimension of work-group for
2082                                                   wavefront lane.
2083      then       Work-Item Id Y             1      32 bit work item id in Y
2084                 (enable_vgpr_workitem_id          dimension of work-group for
2085                 > 0)                              wavefront lane.
2086      then       Work-Item Id Z             1      32 bit work item id in Z
2087                 (enable_vgpr_workitem_id          dimension of work-group for
2088                 > 1)                              wavefront lane.
2089      ========== ========================== ====== ==============================
2090
2091 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2092
2093 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2094    registers.
2095 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2096    combination including none.
2097 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2098    cannot included with the flat scratch init value which is per queue.
2099 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2100    or (X, Y, Z).
2101
2102 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2103 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2104
2105 The global segment can be accessed either using buffer instructions (GFX6 which
2106 has V# 64 bit address support), flat instructions (GFX7-9), or global
2107 instructions (GFX9).
2108
2109 If buffer operations are used then the compiler can generate a V# with the
2110 following properties:
2111
2112 * base address of 0
2113 * no swizzle
2114 * ATC: 1 if IOMMU present (such as APU)
2115 * ptr64: 1
2116 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2117   APU and NC for dGPU).
2118
2119 .. _amdgpu-amdhsa-kernel-prolog:
2120
2121 Kernel Prolog
2122 ~~~~~~~~~~~~~
2123
2124 .. _amdgpu-amdhsa-m0:
2125
2126 M0
2127 ++
2128
2129 GFX6-GFX8
2130   The M0 register must be initialized with a value at least the total LDS size
2131   if the kernel may access LDS via DS or flat operations. Total LDS size is
2132   available in dispatch packet. For M0, it is also possible to use maximum
2133   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2134   GFX7-GFX8).
2135 GFX9
2136   The M0 register is not used for range checking LDS accesses and so does not
2137   need to be initialized in the prolog.
2138
2139 .. _amdgpu-amdhsa-flat-scratch:
2140
2141 Flat Scratch
2142 ++++++++++++
2143
2144 If the kernel may use flat operations to access scratch memory, the prolog code
2145 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2146 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2147 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2148
2149 GFX6
2150   Flat scratch is not supported.
2151
2152 GFX7-8
2153   1. The low word of Flat Scratch Init is 32 bit byte offset from
2154      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2155      being managed by SPI for the queue executing the kernel dispatch. This is
2156      the same value used in the Scratch Segment Buffer V# base address. The
2157      prolog must add the value of Scratch Wave Offset to get the wave's byte
2158      scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2159      FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2160      by 8 before moving into FLAT_SCRATCH_LO.
2161   2. The second word of Flat Scratch Init is 32 bit byte size of a single
2162      work-items scratch memory usage. This is directly loaded from the kernel
2163      dispatch packet Private Segment Byte Size and rounded up to a multiple of
2164      DWORD. Having CP load it once avoids loading it at the beginning of every
2165      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2166      SIZE.
2167 GFX9
2168   The Flat Scratch Init is the 64 bit address of the base of scratch backing
2169   memory being managed by SPI for the queue executing the kernel dispatch. The
2170   prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2171   pair for use as the flat scratch base in flat memory instructions.
2172
2173 .. _amdgpu-amdhsa-memory-model:
2174
2175 Memory Model
2176 ~~~~~~~~~~~~
2177
2178 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2179 (see :ref:`memmodel`). *The implementation is WIP.*
2180
2181 .. TODO
2182    Update when implementation complete.
2183
2184    Support more relaxed OpenCL memory model to be controlled by environment
2185    component of target triple.
2186
2187 The AMDGPU backend supports the memory synchronization scopes specified in
2188 :ref:`amdgpu-memory-scopes`.
2189
2190 The code sequences used to implement the memory model are defined in table
2191 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2192
2193 The sequences specify the order of instructions that a single thread must
2194 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2195 to other memory instructions executed by the same thread. This allows them to be
2196 moved earlier or later which can allow them to be combined with other instances
2197 of the same instruction, or hoisted/sunk out of loops to improve
2198 performance. Only the instructions related to the memory model are given;
2199 additional ``s_waitcnt`` instructions are required to ensure registers are
2200 defined before being used. These may be able to be combined with the memory
2201 model ``s_waitcnt`` instructions as described above.
2202
2203 The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2204 OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2205 relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2206 memory model which has separate happens-before relations for the global and
2207 local address spaces, and only a fence specifying both global and local address
2208 space joins the relationships. Since the LLVM ``memfence`` instruction does not
2209 allow an address space to be specified the OpenCL fence has to convervatively
2210 assume both local and global address space was specified. However, optimizations
2211 can often be done to eliminate the additional ``s_waitcnt``instructions when
2212 there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2213 instructions. The code sequences in the table indicate what can be omitted for
2214 the OpenCL memory. The target triple environment is used to determine if the
2215 source language is OpenCL (see :ref:`amdgpu-opencl`).
2216
2217 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2218 operations.
2219
2220 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2221 termed vector memory operations.
2222
2223 For GFX6-GFX9:
2224
2225 * Each agent has multiple compute units (CU).
2226 * Each CU has multiple SIMDs that execute wavefronts.
2227 * The wavefronts for a single work-group are executed in the same CU but may be
2228   executed by different SIMDs.
2229 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2230   executing on it.
2231 * All LDS operations of a CU are performed as wavefront wide operations in a
2232   global order and involve no caching. Completion is reported to a wavefront in
2233   execution order.
2234 * The LDS memory has multiple request queues shared by the SIMDs of a
2235   CU. Therefore, the LDS operations performed by different waves of a work-group
2236   can be reordered relative to each other, which can result in reordering the
2237   visibility of vector memory operations with respect to LDS operations of other
2238   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2239   ensure synchronization between LDS operations and vector memory operations
2240   between waves of a work-group, but not between operations performed by the
2241   same wavefront.
2242 * The vector memory operations are performed as wavefront wide operations and
2243   completion is reported to a wavefront in execution order. The exception is
2244   that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2245   vector memory order if they access LDS memory, and out of LDS operation order
2246   if they access global memory.
2247 * The vector memory operations access a vector L1 cache shared by all wavefronts
2248   on a CU. Therefore, no special action is required for coherence between
2249   wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2250   coherence between waves executing in different work-groups as they may be
2251   executing on different CUs.
2252 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2253   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2254   scalar operations are used in a restricted way so do not impact the memory
2255   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2256 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2257   the same agent.
2258 * The L2 cache has independent channels to service disjoint ranges of virtual
2259   addresses.
2260 * Each CU has a separate request queue per channel. Therefore, the vector and
2261   scalar memory operations performed by waves executing in different work-groups
2262   (which may be executing on different CUs) of an agent can be reordered
2263   relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2264   synchronization between vector memory operations of different CUs. It ensures a
2265   previous vector memory operation has completed before executing a subsequent
2266   vector memory or LDS operation and so can be used to meet the requirements of
2267   acquire and release.
2268 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2269   of virtual addresses can be set up to bypass it to ensure system coherence.
2270
2271 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2272 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2273 memory, atomic memory orderings are not meaningful and all accesses are treated
2274 as non-atomic.
2275
2276 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2277 scalar memory instructions). Since the constant address space contents do not
2278 change during the execution of a kernel dispatch it is not legal to perform
2279 stores, and atomic memory orderings are not meaningful and all access are
2280 treated as non-atomic.
2281
2282 A memory synchronization scope wider than work-group is not meaningful for the
2283 group (LDS) address space and is treated as work-group.
2284
2285 The memory model does not support the region address space which is treated as
2286 non-atomic.
2287
2288 Acquire memory ordering is not meaningful on store atomic instructions and is
2289 treated as non-atomic.
2290
2291 Release memory ordering is not meaningful on load atomic instructions and is
2292 treated a non-atomic.
2293
2294 Acquire-release memory ordering is not meaningful on load or store atomic
2295 instructions and is treated as acquire and release respectively.
2296
2297 AMDGPU backend only uses scalar memory operations to access memory that is
2298 proven to not change during the execution of the kernel dispatch. This includes
2299 constant address space and global address space for program scope const
2300 variables. Therefore the kernel machine code does not have to maintain the
2301 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2302 and vector L1 caches are invalidated between kernel dispatches by CP since
2303 constant address space data may change between kernel dispatch executions. See
2304 :ref:`amdgpu-amdhsa-memory-spaces`.
2305
2306 The one execption is if scalar writes are used to spill SGPR registers. In this
2307 case the AMDGPU backend ensures the memory location used to spill is never
2308 accessed by vector memory operations at the same time. If scalar writes are used
2309 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2310 return since the locations may be used for vector memory instructions by a
2311 future wave that uses the same scratch area, or a function call that creates a
2312 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2313 as all scalar writes are write-before-read in the same thread.
2314
2315 Scratch backing memory (which is used for the private address space) is accessed
2316 with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2317 is only accessed by a single thread, and is always write-before-read,
2318 there is never a need to invalidate these entries from the L1 cache. Hence all
2319 cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2320 lines.
2321
2322 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2323 to invalidate the L2 cache. This also causes it to be treated as non-volatile
2324 and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2325 coherent) and so the L2 cache will coherent with the CPU and other agents.
2326
2327   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2328      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2329
2330      ============ ============ ============== ========== =======================
2331      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2332                   Ordering     Sync Scope     Address
2333                                               Space
2334      ============ ============ ============== ========== =======================
2335      **Non-Atomic**
2336      ---------------------------------------------------------------------------
2337      load         *none*       *none*         - global   non-volatile
2338                                               - generic    1. buffer/global/flat_load
2339                                                          volatile
2340                                                            1. buffer/global/flat_load
2341                                                               glc=1
2342      load         *none*       *none*         - local    1. ds_load
2343      store        *none*       *none*         - global   1. buffer/global/flat_store
2344                                               - generic
2345      store        *none*       *none*         - local    1. ds_store
2346      **Unordered Atomic**
2347      ---------------------------------------------------------------------------
2348      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2349      store atomic unordered    *any*          *any*      *Same as non-atomic*.
2350      atomicrmw    unordered    *any*          *any*      *Same as monotonic
2351                                                          atomic*.
2352      **Monotonic Atomic**
2353      ---------------------------------------------------------------------------
2354      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2355                                - wavefront    - generic
2356                                - workgroup
2357      load atomic  monotonic    - singlethread - local    1. ds_load
2358                                - wavefront
2359                                - workgroup
2360      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2361                                - system       - generic     glc=1
2362      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2363                                - wavefront    - generic
2364                                - workgroup
2365                                - agent
2366                                - system
2367      store atomic monotonic    - singlethread - local    1. ds_store
2368                                - wavefront
2369                                - workgroup
2370      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2371                                - wavefront    - generic
2372                                - workgroup
2373                                - agent
2374                                - system
2375      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2376                                - wavefront
2377                                - workgroup
2378      **Acquire Atomic**
2379      ---------------------------------------------------------------------------
2380      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2381                                - wavefront    - local
2382                                               - generic
2383      load atomic  acquire      - workgroup    - global   1. buffer/global_load
2384      load atomic  acquire      - workgroup    - local    1. ds/flat_load
2385                                               - generic  2. s_waitcnt lgkmcnt(0)
2386
2387                                                            - If OpenCL, omit
2388                                                              waitcnt.
2389                                                            - Must happen before
2390                                                              any following
2391                                                              global/generic
2392                                                              load/load
2393                                                              atomic/store/store
2394                                                              atomic/atomicrmw.
2395                                                            - Ensures any
2396                                                              following global
2397                                                              data read is no
2398                                                              older than the load
2399                                                              atomic value being
2400                                                              acquired.
2401
2402      load atomic  acquire      - agent        - global   1. buffer/global_load
2403                                - system                     glc=1
2404                                                          2. s_waitcnt vmcnt(0)
2405
2406                                                            - Must happen before
2407                                                              following
2408                                                              buffer_wbinvl1_vol.
2409                                                            - Ensures the load
2410                                                              has completed
2411                                                              before invalidating
2412                                                              the cache.
2413
2414                                                          3. buffer_wbinvl1_vol
2415
2416                                                            - Must happen before
2417                                                              any following
2418                                                              global/generic
2419                                                              load/load
2420                                                              atomic/atomicrmw.
2421                                                            - Ensures that
2422                                                              following
2423                                                              loads will not see
2424                                                              stale global data.
2425
2426      load atomic  acquire      - agent        - generic  1. flat_load glc=1
2427                                - system                  2. s_waitcnt vmcnt(0) &
2428                                                             lgkmcnt(0)
2429
2430                                                            - If OpenCL omit
2431                                                              lgkmcnt(0).
2432                                                            - Must happen before
2433                                                              following
2434                                                              buffer_wbinvl1_vol.
2435                                                            - Ensures the flat_load
2436                                                              has completed
2437                                                              before invalidating
2438                                                              the cache.
2439
2440                                                          3. buffer_wbinvl1_vol
2441
2442                                                            - Must happen before
2443                                                              any following
2444                                                              global/generic
2445                                                              load/load
2446                                                              atomic/atomicrmw.
2447                                                            - Ensures that
2448                                                              following loads
2449                                                              will not see stale
2450                                                              global data.
2451
2452      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2453                                - wavefront    - local
2454                                               - generic
2455      atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
2456      atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
2457                                               - generic  2. waitcnt lgkmcnt(0)
2458
2459                                                            - If OpenCL, omit
2460                                                              waitcnt.
2461                                                            - Must happen before
2462                                                              any following
2463                                                              global/generic
2464                                                              load/load
2465                                                              atomic/store/store
2466                                                              atomic/atomicrmw.
2467                                                            - Ensures any
2468                                                              following global
2469                                                              data read is no
2470                                                              older than the
2471                                                              atomicrmw value
2472                                                              being acquired.
2473
2474      atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
2475                                - system                  2. s_waitcnt vmcnt(0)
2476
2477                                                            - Must happen before
2478                                                              following
2479                                                              buffer_wbinvl1_vol.
2480                                                            - Ensures the
2481                                                              atomicrmw has
2482                                                              completed before
2483                                                              invalidating the
2484                                                              cache.
2485
2486                                                          3. buffer_wbinvl1_vol
2487
2488                                                            - Must happen before
2489                                                              any following
2490                                                              global/generic
2491                                                              load/load
2492                                                              atomic/atomicrmw.
2493                                                            - Ensures that
2494                                                              following loads
2495                                                              will not see stale
2496                                                              global data.
2497
2498      atomicrmw    acquire      - agent        - generic  1. flat_atomic
2499                                - system                  2. s_waitcnt vmcnt(0) &
2500                                                             lgkmcnt(0)
2501
2502                                                            - If OpenCL, omit
2503                                                              lgkmcnt(0).
2504                                                            - Must happen before
2505                                                              following
2506                                                              buffer_wbinvl1_vol.
2507                                                            - Ensures the
2508                                                              atomicrmw has
2509                                                              completed before
2510                                                              invalidating the
2511                                                              cache.
2512
2513                                                          3. buffer_wbinvl1_vol
2514
2515                                                            - Must happen before
2516                                                              any following
2517                                                              global/generic
2518                                                              load/load
2519                                                              atomic/atomicrmw.
2520                                                            - Ensures that
2521                                                              following loads
2522                                                              will not see stale
2523                                                              global data.
2524
2525      fence        acquire      - singlethread *none*     *none*
2526                                - wavefront
2527      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2528
2529                                                            - If OpenCL and
2530                                                              address space is
2531                                                              not generic, omit
2532                                                              waitcnt. However,
2533                                                              since LLVM
2534                                                              currently has no
2535                                                              address space on
2536                                                              the fence need to
2537                                                              conservatively
2538                                                              always generate. If
2539                                                              fence had an
2540                                                              address space then
2541                                                              set to address
2542                                                              space of OpenCL
2543                                                              fence flag, or to
2544                                                              generic if both
2545                                                              local and global
2546                                                              flags are
2547                                                              specified.
2548                                                            - Must happen after
2549                                                              any preceding
2550                                                              local/generic load
2551                                                              atomic/atomicrmw
2552                                                              with an equal or
2553                                                              wider sync scope
2554                                                              and memory ordering
2555                                                              stronger than
2556                                                              unordered (this is
2557                                                              termed the
2558                                                              fence-paired-atomic).
2559                                                            - Must happen before
2560                                                              any following
2561                                                              global/generic
2562                                                              load/load
2563                                                              atomic/store/store
2564                                                              atomic/atomicrmw.
2565                                                            - Ensures any
2566                                                              following global
2567                                                              data read is no
2568                                                              older than the
2569                                                              value read by the
2570                                                              fence-paired-atomic.
2571
2572      fence        acquire      - agent        *none*     1. s_waitcnt vmcnt(0) &
2573                                - system                     lgkmcnt(0)
2574
2575                                                            - If OpenCL and
2576                                                              address space is
2577                                                              not generic, omit
2578                                                              lgkmcnt(0).
2579                                                              However, since LLVM
2580                                                              currently has no
2581                                                              address space on
2582                                                              the fence need to
2583                                                              conservatively
2584                                                              always generate
2585                                                              (see comment for
2586                                                              previous fence).
2587                                                            - Could be split into
2588                                                              separate s_waitcnt
2589                                                              vmcnt(0) and
2590                                                              s_waitcnt
2591                                                              lgkmcnt(0) to allow
2592                                                              them to be
2593                                                              independently moved
2594                                                              according to the
2595                                                              following rules.
2596                                                            - s_waitcnt vmcnt(0)
2597                                                              must happen after
2598                                                              any preceding
2599                                                              global/generic load
2600                                                              atomic/atomicrmw
2601                                                              with an equal or
2602                                                              wider sync scope
2603                                                              and memory ordering
2604                                                              stronger than
2605                                                              unordered (this is
2606                                                              termed the
2607                                                              fence-paired-atomic).
2608                                                            - s_waitcnt lgkmcnt(0)
2609                                                              must happen after
2610                                                              any preceding
2611                                                              group/generic load
2612                                                              atomic/atomicrmw
2613                                                              with an equal or
2614                                                              wider sync scope
2615                                                              and memory ordering
2616                                                              stronger than
2617                                                              unordered (this is
2618                                                              termed the
2619                                                              fence-paired-atomic).
2620                                                            - Must happen before
2621                                                              the following
2622                                                              buffer_wbinvl1_vol.
2623                                                            - Ensures that the
2624                                                              fence-paired atomic
2625                                                              has completed
2626                                                              before invalidating
2627                                                              the
2628                                                              cache. Therefore
2629                                                              any following
2630                                                              locations read must
2631                                                              be no older than
2632                                                              the value read by
2633                                                              the
2634                                                              fence-paired-atomic.
2635
2636                                                          2. buffer_wbinvl1_vol
2637
2638                                                            - Must happen before
2639                                                              any following global/generic
2640                                                              load/load
2641                                                              atomic/store/store
2642                                                              atomic/atomicrmw.
2643                                                            - Ensures that
2644                                                              following loads
2645                                                              will not see stale
2646                                                              global data.
2647
2648      **Release Atomic**
2649      ---------------------------------------------------------------------------
2650      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2651                                - wavefront    - local
2652                                               - generic
2653      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2654                                               - generic
2655                                                            - If OpenCL, omit
2656                                                              waitcnt.
2657                                                            - Must happen after
2658                                                              any preceding
2659                                                              local/generic
2660                                                              load/store/load
2661                                                              atomic/store
2662                                                              atomic/atomicrmw.
2663                                                            - Must happen before
2664                                                              the following
2665                                                              store.
2666                                                            - Ensures that all
2667                                                              memory operations
2668                                                              to local have
2669                                                              completed before
2670                                                              performing the
2671                                                              store that is being
2672                                                              released.
2673
2674                                                          2. buffer/global/flat_store
2675      store atomic release      - workgroup    - local    1. ds_store
2676      store atomic release      - agent        - global   1. s_waitcnt vmcnt(0) &
2677                                - system       - generic     lgkmcnt(0)
2678
2679                                                            - If OpenCL, omit
2680                                                              lgkmcnt(0).
2681                                                            - Could be split into
2682                                                              separate s_waitcnt
2683                                                              vmcnt(0) and
2684                                                              s_waitcnt
2685                                                              lgkmcnt(0) to allow
2686                                                              them to be
2687                                                              independently moved
2688                                                              according to the
2689                                                              following rules.
2690                                                            - s_waitcnt vmcnt(0)
2691                                                              must happen after
2692                                                              any preceding
2693                                                              global/generic
2694                                                              load/store/load
2695                                                              atomic/store
2696                                                              atomic/atomicrmw.
2697                                                            - s_waitcnt lgkmcnt(0)
2698                                                              must happen after
2699                                                              any preceding
2700                                                              local/generic
2701                                                              load/store/load
2702                                                              atomic/store
2703                                                              atomic/atomicrmw.
2704                                                            - Must happen before
2705                                                              the following
2706                                                              store.
2707                                                            - Ensures that all
2708                                                              memory operations
2709                                                              to global have
2710                                                              completed before
2711                                                              performing the
2712                                                              store that is being
2713                                                              released.
2714
2715                                                          2. buffer/global/ds/flat_store
2716      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
2717                                - wavefront    - local
2718                                               - generic
2719      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2720                                               - generic
2721                                                            - If OpenCL, omit
2722                                                              waitcnt.
2723                                                            - Must happen after
2724                                                              any preceding
2725                                                              local/generic
2726                                                              load/store/load
2727                                                              atomic/store
2728                                                              atomic/atomicrmw.
2729                                                            - Must happen before
2730                                                              the following
2731                                                              atomicrmw.
2732                                                            - Ensures that all
2733                                                              memory operations
2734                                                              to local have
2735                                                              completed before
2736                                                              performing the
2737                                                              atomicrmw that is
2738                                                              being released.
2739
2740                                                          2. buffer/global/flat_atomic
2741      atomicrmw    release      - workgroup    - local    1. ds_atomic
2742      atomicrmw    release      - agent        - global   1. s_waitcnt vmcnt(0) &
2743                                - system       - generic     lgkmcnt(0)
2744
2745                                                            - If OpenCL, omit
2746                                                              lgkmcnt(0).
2747                                                            - Could be split into
2748                                                              separate s_waitcnt
2749                                                              vmcnt(0) and
2750                                                              s_waitcnt
2751                                                              lgkmcnt(0) to allow
2752                                                              them to be
2753                                                              independently moved
2754                                                              according to the
2755                                                              following rules.
2756                                                            - s_waitcnt vmcnt(0)
2757                                                              must happen after
2758                                                              any preceding
2759                                                              global/generic
2760                                                              load/store/load
2761                                                              atomic/store
2762                                                              atomic/atomicrmw.
2763                                                            - s_waitcnt lgkmcnt(0)
2764                                                              must happen after
2765                                                              any preceding
2766                                                              local/generic
2767                                                              load/store/load
2768                                                              atomic/store
2769                                                              atomic/atomicrmw.
2770                                                            - Must happen before
2771                                                              the following
2772                                                              atomicrmw.
2773                                                            - Ensures that all
2774                                                              memory operations
2775                                                              to global and local
2776                                                              have completed
2777                                                              before performing
2778                                                              the atomicrmw that
2779                                                              is being released.
2780
2781                                                          2. buffer/global/ds/flat_atomic*
2782      fence        release      - singlethread *none*     *none*
2783                                - wavefront
2784      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2785
2786                                                            - If OpenCL and
2787                                                              address space is
2788                                                              not generic, omit
2789                                                              waitcnt. However,
2790                                                              since LLVM
2791                                                              currently has no
2792                                                              address space on
2793                                                              the fence need to
2794                                                              conservatively
2795                                                              always generate
2796                                                              (see comment for
2797                                                              previous fence).
2798                                                            - Must happen after
2799                                                              any preceding
2800                                                              local/generic
2801                                                              load/load
2802                                                              atomic/store/store
2803                                                              atomic/atomicrmw.
2804                                                            - Must happen before
2805                                                              any following store
2806                                                              atomic/atomicrmw
2807                                                              with an equal or
2808                                                              wider sync scope
2809                                                              and memory ordering
2810                                                              stronger than
2811                                                              unordered (this is
2812                                                              termed the
2813                                                              fence-paired-atomic).
2814                                                            - Ensures that all
2815                                                              memory operations
2816                                                              to local have
2817                                                              completed before
2818                                                              performing the
2819                                                              following
2820                                                              fence-paired-atomic.
2821
2822      fence        release      - agent        *none*     1. s_waitcnt vmcnt(0) &
2823                                - system                     lgkmcnt(0)
2824
2825                                                            - If OpenCL and
2826                                                              address space is
2827                                                              not generic, omit
2828                                                              lgkmcnt(0).
2829                                                              However, since LLVM
2830                                                              currently has no
2831                                                              address space on
2832                                                              the fence need to
2833                                                              conservatively
2834                                                              always generate
2835                                                              (see comment for
2836                                                              previous fence).
2837                                                            - Could be split into
2838                                                              separate s_waitcnt
2839                                                              vmcnt(0) and
2840                                                              s_waitcnt
2841                                                              lgkmcnt(0) to allow
2842                                                              them to be
2843                                                              independently moved
2844                                                              according to the
2845                                                              following rules.
2846                                                            - s_waitcnt vmcnt(0)
2847                                                              must happen after
2848                                                              any preceding
2849                                                              global/generic
2850                                                              load/store/load
2851                                                              atomic/store
2852                                                              atomic/atomicrmw.
2853                                                            - s_waitcnt lgkmcnt(0)
2854                                                              must happen after
2855                                                              any preceding
2856                                                              local/generic
2857                                                              load/store/load
2858                                                              atomic/store
2859                                                              atomic/atomicrmw.
2860                                                            - Must happen before
2861                                                              any following store
2862                                                              atomic/atomicrmw
2863                                                              with an equal or
2864                                                              wider sync scope
2865                                                              and memory ordering
2866                                                              stronger than
2867                                                              unordered (this is
2868                                                              termed the
2869                                                              fence-paired-atomic).
2870                                                            - Ensures that all
2871                                                              memory operations
2872                                                              to global have
2873                                                              completed before
2874                                                              performing the
2875                                                              following
2876                                                              fence-paired-atomic.
2877
2878      **Acquire-Release Atomic**
2879      ---------------------------------------------------------------------------
2880      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
2881                                - wavefront    - local
2882                                               - generic
2883      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2884
2885                                                            - If OpenCL, omit
2886                                                              waitcnt.
2887                                                            - Must happen after
2888                                                              any preceding
2889                                                              local/generic
2890                                                              load/store/load
2891                                                              atomic/store
2892                                                              atomic/atomicrmw.
2893                                                            - Must happen before
2894                                                              the following
2895                                                              atomicrmw.
2896                                                            - Ensures that all
2897                                                              memory operations
2898                                                              to local have
2899                                                              completed before
2900                                                              performing the
2901                                                              atomicrmw that is
2902                                                              being released.
2903
2904                                                          2. buffer/global_atomic
2905      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
2906                                                          2. s_waitcnt lgkmcnt(0)
2907
2908                                                            - If OpenCL, omit
2909                                                              waitcnt.
2910                                                            - Must happen before
2911                                                              any following
2912                                                              global/generic
2913                                                              load/load
2914                                                              atomic/store/store
2915                                                              atomic/atomicrmw.
2916                                                            - Ensures any
2917                                                              following global
2918                                                              data read is no
2919                                                              older than the load
2920                                                              atomic value being
2921                                                              acquired.
2922
2923      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2924
2925                                                            - If OpenCL, omit
2926                                                              waitcnt.
2927                                                            - Must happen after
2928                                                              any preceding
2929                                                              local/generic
2930                                                              load/store/load
2931                                                              atomic/store
2932                                                              atomic/atomicrmw.
2933                                                            - Must happen before
2934                                                              the following
2935                                                              atomicrmw.
2936                                                            - Ensures that all
2937                                                              memory operations
2938                                                              to local have
2939                                                              completed before
2940                                                              performing the
2941                                                              atomicrmw that is
2942                                                              being released.
2943
2944                                                          2. flat_atomic
2945                                                          3. s_waitcnt lgkmcnt(0)
2946
2947                                                            - If OpenCL, omit
2948                                                              waitcnt.
2949                                                            - Must happen before
2950                                                              any following
2951                                                              global/generic
2952                                                              load/load
2953                                                              atomic/store/store
2954                                                              atomic/atomicrmw.
2955                                                            - Ensures any
2956                                                              following global
2957                                                              data read is no
2958                                                              older than the load
2959                                                              atomic value being
2960                                                              acquired.
2961      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt vmcnt(0) &
2962                                - system                     lgkmcnt(0)
2963
2964                                                            - If OpenCL, omit
2965                                                              lgkmcnt(0).
2966                                                            - Could be split into
2967                                                              separate s_waitcnt
2968                                                              vmcnt(0) and
2969                                                              s_waitcnt
2970                                                              lgkmcnt(0) to allow
2971                                                              them to be
2972                                                              independently moved
2973                                                              according to the
2974                                                              following rules.
2975                                                            - s_waitcnt vmcnt(0)
2976                                                              must happen after
2977                                                              any preceding
2978                                                              global/generic
2979                                                              load/store/load
2980                                                              atomic/store
2981                                                              atomic/atomicrmw.
2982                                                            - s_waitcnt lgkmcnt(0)
2983                                                              must happen after
2984                                                              any preceding
2985                                                              local/generic
2986                                                              load/store/load
2987                                                              atomic/store
2988                                                              atomic/atomicrmw.
2989                                                            - Must happen before
2990                                                              the following
2991                                                              atomicrmw.
2992                                                            - Ensures that all
2993                                                              memory operations
2994                                                              to global have
2995                                                              completed before
2996                                                              performing the
2997                                                              atomicrmw that is
2998                                                              being released.
2999
3000                                                          2. buffer/global_atomic
3001                                                          3. s_waitcnt vmcnt(0)
3002
3003                                                            - Must happen before
3004                                                              following
3005                                                              buffer_wbinvl1_vol.
3006                                                            - Ensures the
3007                                                              atomicrmw has
3008                                                              completed before
3009                                                              invalidating the
3010                                                              cache.
3011
3012                                                          4. buffer_wbinvl1_vol
3013
3014                                                            - Must happen before
3015                                                              any following
3016                                                              global/generic
3017                                                              load/load
3018                                                              atomic/atomicrmw.
3019                                                            - Ensures that
3020                                                              following loads
3021                                                              will not see stale
3022                                                              global data.
3023
3024      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt vmcnt(0) &
3025                                - system                     lgkmcnt(0)
3026
3027                                                            - If OpenCL, omit
3028                                                              lgkmcnt(0).
3029                                                            - Could be split into
3030                                                              separate s_waitcnt
3031                                                              vmcnt(0) and
3032                                                              s_waitcnt
3033                                                              lgkmcnt(0) to allow
3034                                                              them to be
3035                                                              independently moved
3036                                                              according to the
3037                                                              following rules.
3038                                                            - s_waitcnt vmcnt(0)
3039                                                              must happen after
3040                                                              any preceding
3041                                                              global/generic
3042                                                              load/store/load
3043                                                              atomic/store
3044                                                              atomic/atomicrmw.
3045                                                            - s_waitcnt lgkmcnt(0)
3046                                                              must happen after
3047                                                              any preceding
3048                                                              local/generic
3049                                                              load/store/load
3050                                                              atomic/store
3051                                                              atomic/atomicrmw.
3052                                                            - Must happen before
3053                                                              the following
3054                                                              atomicrmw.
3055                                                            - Ensures that all
3056                                                              memory operations
3057                                                              to global have
3058                                                              completed before
3059                                                              performing the
3060                                                              atomicrmw that is
3061                                                              being released.
3062
3063                                                          2. flat_atomic
3064                                                          3. s_waitcnt vmcnt(0) &
3065                                                             lgkmcnt(0)
3066
3067                                                            - If OpenCL, omit
3068                                                              lgkmcnt(0).
3069                                                            - Must happen before
3070                                                              following
3071                                                              buffer_wbinvl1_vol.
3072                                                            - Ensures the
3073                                                              atomicrmw has
3074                                                              completed before
3075                                                              invalidating the
3076                                                              cache.
3077
3078                                                          4. buffer_wbinvl1_vol
3079
3080                                                            - Must happen before
3081                                                              any following
3082                                                              global/generic
3083                                                              load/load
3084                                                              atomic/atomicrmw.
3085                                                            - Ensures that
3086                                                              following loads
3087                                                              will not see stale
3088                                                              global data.
3089
3090      fence        acq_rel      - singlethread *none*     *none*
3091                                - wavefront
3092      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3093
3094                                                            - If OpenCL and
3095                                                              address space is
3096                                                              not generic, omit
3097                                                              waitcnt. However,
3098                                                              since LLVM
3099                                                              currently has no
3100                                                              address space on
3101                                                              the fence need to
3102                                                              conservatively
3103                                                              always generate
3104                                                              (see comment for
3105                                                              previous fence).
3106                                                            - Must happen after
3107                                                              any preceding
3108                                                              local/generic
3109                                                              load/load
3110                                                              atomic/store/store
3111                                                              atomic/atomicrmw.
3112                                                            - Must happen before
3113                                                              any following
3114                                                              global/generic
3115                                                              load/load
3116                                                              atomic/store/store
3117                                                              atomic/atomicrmw.
3118                                                            - Ensures that all
3119                                                              memory operations
3120                                                              to local have
3121                                                              completed before
3122                                                              performing any
3123                                                              following global
3124                                                              memory operations.
3125                                                            - Ensures that the
3126                                                              preceding
3127                                                              local/generic load
3128                                                              atomic/atomicrmw
3129                                                              with an equal or
3130                                                              wider sync scope
3131                                                              and memory ordering
3132                                                              stronger than
3133                                                              unordered (this is
3134                                                              termed the
3135                                                              fence-paired-atomic)
3136                                                              has completed
3137                                                              before following
3138                                                              global memory
3139                                                              operations. This
3140                                                              satisfies the
3141                                                              requirements of
3142                                                              acquire.
3143                                                            - Ensures that all
3144                                                              previous memory
3145                                                              operations have
3146                                                              completed before a
3147                                                              following
3148                                                              local/generic store
3149                                                              atomic/atomicrmw
3150                                                              with an equal or
3151                                                              wider sync scope
3152                                                              and memory ordering
3153                                                              stronger than
3154                                                              unordered (this is
3155                                                              termed the
3156                                                              fence-paired-atomic).
3157                                                              This satisfies the
3158                                                              requirements of
3159                                                              release.
3160
3161      fence        acq_rel      - agent        *none*     1. s_waitcnt vmcnt(0) &
3162                                - system                     lgkmcnt(0)
3163
3164                                                            - If OpenCL and
3165                                                              address space is
3166                                                              not generic, omit
3167                                                              lgkmcnt(0).
3168                                                              However, since LLVM
3169                                                              currently has no
3170                                                              address space on
3171                                                              the fence need to
3172                                                              conservatively
3173                                                              always generate
3174                                                              (see comment for
3175                                                              previous fence).
3176                                                            - Could be split into
3177                                                              separate s_waitcnt
3178                                                              vmcnt(0) and
3179                                                              s_waitcnt
3180                                                              lgkmcnt(0) to allow
3181                                                              them to be
3182                                                              independently moved
3183                                                              according to the
3184                                                              following rules.
3185                                                            - s_waitcnt vmcnt(0)
3186                                                              must happen after
3187                                                              any preceding
3188                                                              global/generic
3189                                                              load/store/load
3190                                                              atomic/store
3191                                                              atomic/atomicrmw.
3192                                                            - s_waitcnt lgkmcnt(0)
3193                                                              must happen after
3194                                                              any preceding
3195                                                              local/generic
3196                                                              load/store/load
3197                                                              atomic/store
3198                                                              atomic/atomicrmw.
3199                                                            - Must happen before
3200                                                              the following
3201                                                              buffer_wbinvl1_vol.
3202                                                            - Ensures that the
3203                                                              preceding
3204                                                              global/local/generic
3205                                                              load
3206                                                              atomic/atomicrmw
3207                                                              with an equal or
3208                                                              wider sync scope
3209                                                              and memory ordering
3210                                                              stronger than
3211                                                              unordered (this is
3212                                                              termed the
3213                                                              fence-paired-atomic)
3214                                                              has completed
3215                                                              before invalidating
3216                                                              the cache. This
3217                                                              satisfies the
3218                                                              requirements of
3219                                                              acquire.
3220                                                            - Ensures that all
3221                                                              previous memory
3222                                                              operations have
3223                                                              completed before a
3224                                                              following
3225                                                              global/local/generic
3226                                                              store
3227                                                              atomic/atomicrmw
3228                                                              with an equal or
3229                                                              wider sync scope
3230                                                              and memory ordering
3231                                                              stronger than
3232                                                              unordered (this is
3233                                                              termed the
3234                                                              fence-paired-atomic).
3235                                                              This satisfies the
3236                                                              requirements of
3237                                                              release.
3238
3239                                                          2. buffer_wbinvl1_vol
3240
3241                                                            - Must happen before
3242                                                              any following
3243                                                              global/generic
3244                                                              load/load
3245                                                              atomic/store/store
3246                                                              atomic/atomicrmw.
3247                                                            - Ensures that
3248                                                              following loads
3249                                                              will not see stale
3250                                                              global data. This
3251                                                              satisfies the
3252                                                              requirements of
3253                                                              acquire.
3254
3255      **Sequential Consistent Atomic**
3256      ---------------------------------------------------------------------------
3257      load atomic  seq_cst      - singlethread - global   *Same as corresponding
3258                                - wavefront    - local    load atomic acquire*.
3259                                - workgroup    - generic
3260      load atomic  seq_cst      - agent        - global   1. s_waitcnt vmcnt(0)
3261                                - system       - local
3262                                               - generic    - Must happen after
3263                                                              preceding
3264                                                              global/generic load
3265                                                              atomic/store
3266                                                              atomic/atomicrmw
3267                                                              with memory
3268                                                              ordering of seq_cst
3269                                                              and with equal or
3270                                                              wider sync scope.
3271                                                              (Note that seq_cst
3272                                                              fences have their
3273                                                              own s_waitcnt
3274                                                              vmcnt(0) and so do
3275                                                              not need to be
3276                                                              considered.)
3277                                                            - Ensures any
3278                                                              preceding
3279                                                              sequential
3280                                                              consistent global
3281                                                              memory instructions
3282                                                              have completed
3283                                                              before executing
3284                                                              this sequentially
3285                                                              consistent
3286                                                              instruction. This
3287                                                              prevents reordering
3288                                                              a seq_cst store
3289                                                              followed by a
3290                                                              seq_cst load (Note
3291                                                              that seq_cst is
3292                                                              stronger than
3293                                                              acquire/release as
3294                                                              the reordering of
3295                                                              load acquire
3296                                                              followed by a store
3297                                                              release is
3298                                                              prevented by the
3299                                                              waitcnt vmcnt(0) of
3300                                                              the release, but
3301                                                              there is nothing
3302                                                              preventing a store
3303                                                              release followed by
3304                                                              load acquire from
3305                                                              competing out of
3306                                                              order.)
3307
3308                                                          2. *Following
3309                                                             instructions same as
3310                                                             corresponding load
3311                                                             atomic acquire*.
3312
3313      store atomic seq_cst      - singlethread - global   *Same as corresponding
3314                                - wavefront    - local    store atomic release*.
3315                                - workgroup    - generic
3316      store atomic seq_cst      - agent        - global   *Same as corresponding
3317                                - system       - generic  store atomic release*.
3318      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3319                                - wavefront    - local    atomicrmw acq_rel*.
3320                                - workgroup    - generic
3321      atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3322                                - system       - generic  atomicrmw acq_rel*.
3323      fence        seq_cst      - singlethread *none*     *Same as corresponding
3324                                - wavefront               fence acq_rel*.
3325                                - workgroup
3326                                - agent
3327                                - system
3328      ============ ============ ============== ========== =======================
3329
3330 The memory order also adds the single thread optimization constrains defined in
3331 table
3332 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3333
3334   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3335      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3336
3337      ============ ==============================================================
3338      LLVM Memory  Optimization Constraints
3339      Ordering
3340      ============ ==============================================================
3341      unordered    *none*
3342      monotonic    *none*
3343      acquire      - If a load atomic/atomicrmw then no following load/load
3344                     atomic/store/ store atomic/atomicrmw/fence instruction can
3345                     be moved before the acquire.
3346                   - If a fence then same as load atomic, plus no preceding
3347                     associated fence-paired-atomic can be moved after the fence.
3348      release      - If a store atomic/atomicrmw then no preceding load/load
3349                     atomic/store/ store atomic/atomicrmw/fence instruction can
3350                     be moved after the release.
3351                   - If a fence then same as store atomic, plus no following
3352                     associated fence-paired-atomic can be moved before the
3353                     fence.
3354      acq_rel      Same constraints as both acquire and release.
3355      seq_cst      - If a load atomic then same constraints as acquire, plus no
3356                     preceding sequentially consistent load atomic/store
3357                     atomic/atomicrmw/fence instruction can be moved after the
3358                     seq_cst.
3359                   - If a store atomic then the same constraints as release, plus
3360                     no following sequentially consistent load atomic/store
3361                     atomic/atomicrmw/fence instruction can be moved before the
3362                     seq_cst.
3363                   - If an atomicrmw/fence then same constraints as acq_rel.
3364      ============ ==============================================================
3365
3366 Trap Handler ABI
3367 ~~~~~~~~~~~~~~~~
3368
3369 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3370 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3371 the ``s_trap`` instruction with the following usage:
3372
3373   .. table:: AMDGPU Trap Handler for AMDHSA OS
3374      :name: amdgpu-trap-handler-for-amdhsa-os-table
3375
3376      =================== =============== =============== =======================
3377      Usage               Code Sequence   Trap Handler    Description
3378                                          Inputs
3379      =================== =============== =============== =======================
3380      reserved            ``s_trap 0x00``                 Reserved by hardware.
3381      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3382                                            ``queue_ptr`` ``debugtrap``
3383                                          ``VGPR0``:      intrinsic (not
3384                                            ``arg``       implemented).
3385      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3386                                            ``queue_ptr`` terminated and its
3387                                                          associated queue put
3388                                                          into the error state.
3389      ``llvm.debugtrap``  ``s_trap 0x03`` ``SGPR0-1``:    If debugger not
3390                                            ``queue_ptr`` installed handled
3391                                                          same as ``llvm.trap``.
3392      debugger breakpoint ``s_trap 0x07``                 Reserved for  debugger
3393                                                          breakpoints.
3394      debugger            ``s_trap 0x08``                 Reserved for debugger.
3395      debugger            ``s_trap 0xfe``                 Reserved for debugger.
3396      debugger            ``s_trap 0xff``                 Reserved for debugger.
3397      =================== =============== =============== =======================
3398
3399 Unspecified OS
3400 --------------
3401
3402 This section provides code conventions used when the target triple OS is
3403 empty (see :ref:`amdgpu-target-triples`).
3404
3405 Trap Handler ABI
3406 ~~~~~~~~~~~~~~~~
3407
3408 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3409 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3410 instructions are handled as follows:
3411
3412   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3413      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3414
3415      =============== =============== ===========================================
3416      Usage           Code Sequence   Description
3417      =============== =============== ===========================================
3418      llvm.trap       s_endpgm        Causes wavefront to be terminated.
3419      llvm.debugtrap  *none*          Compiler warning given that there is no
3420                                      trap handler installed.
3421      =============== =============== ===========================================
3422
3423 Source Languages
3424 ================
3425
3426 .. _amdgpu-opencl:
3427
3428 OpenCL
3429 ------
3430
3431 When generating code for the OpenCL language the target triple environment
3432 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3433
3434 When the language is OpenCL the following differences occur:
3435
3436 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3437 2. The AMDGPU backend adds additional arguments to the kernel.
3438 3. Additional metadata is generated
3439    (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3440
3441 .. TODO
3442    Specify what affect this has. Hidden arguments added. Additional metadata
3443    generated.
3444
3445 .. _amdgpu-hcc:
3446
3447 HCC
3448 ---
3449
3450 When generating code for the OpenCL language the target triple environment
3451 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3452
3453 When the language is OpenCL the following differences occur:
3454
3455 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3456
3457 .. TODO
3458    Specify what affect this has.
3459
3460 Assembler
3461 ---------
3462
3463 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3464 It supports AMDGCN GFX6-GFX8.
3465
3466 This section describes general syntax for instructions and operands. For more
3467 information about instructions, their semantics and supported combinations of
3468 operands, refer to one of instruction set architecture manuals
3469 [AMD-Souther-Islands]_, [AMD-Sea-Islands]_, [AMD-Volcanic-Islands]_ and
3470 [AMD-Vega]_.
3471
3472 An instruction has the following syntax (register operands are normally
3473 comma-separated while extra operands are space-separated):
3474
3475 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3476
3477 Operands
3478 ~~~~~~~~
3479
3480 The following syntax for register operands is supported:
3481
3482 * SGPR registers: s0, ... or s[0], ...
3483 * VGPR registers: v0, ... or v[0], ...
3484 * TTMP registers: ttmp0, ... or ttmp[0], ...
3485 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3486 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3487 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3488 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3489 * Register index expressions: v[2*2], s[1-1:2-1]
3490 * 'off' indicates that an operand is not enabled
3491
3492 The following extra operands are supported:
3493
3494 * offset, offset0, offset1
3495 * idxen, offen bits
3496 * glc, slc, tfe bits
3497 * waitcnt: integer or combination of counter values
3498 * VOP3 modifiers:
3499
3500   - abs (\| \|), neg (\-)
3501
3502 * DPP modifiers:
3503
3504   - row_shl, row_shr, row_ror, row_rol
3505   - row_mirror, row_half_mirror, row_bcast
3506   - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3507   - row_mask, bank_mask, bound_ctrl
3508
3509 * SDWA modifiers:
3510
3511   - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3512   - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3513   - abs, neg, sext
3514
3515 Instruction Examples
3516 ~~~~~~~~~~~~~~~~~~~~
3517
3518 DS
3519 ~~
3520
3521 .. code-block:: nasm
3522
3523   ds_add_u32 v2, v4 offset:16
3524   ds_write_src2_b64 v2 offset0:4 offset1:8
3525   ds_cmpst_f32 v2, v4, v6
3526   ds_min_rtn_f64 v[8:9], v2, v[4:5]
3527
3528
3529 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3530
3531 FLAT
3532 ++++
3533
3534 .. code-block:: nasm
3535
3536   flat_load_dword v1, v[3:4]
3537   flat_store_dwordx3 v[3:4], v[5:7]
3538   flat_atomic_swap v1, v[3:4], v5 glc
3539   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3540   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3541
3542 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3543
3544 MUBUF
3545 +++++
3546
3547 .. code-block:: nasm
3548
3549   buffer_load_dword v1, off, s[4:7], s1
3550   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3551   buffer_store_format_xy v[1:2], off, s[4:7], s1
3552   buffer_wbinvl1
3553   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3554
3555 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3556
3557 SMRD/SMEM
3558 +++++++++
3559
3560 .. code-block:: nasm
3561
3562   s_load_dword s1, s[2:3], 0xfc
3563   s_load_dwordx8 s[8:15], s[2:3], s4
3564   s_load_dwordx16 s[88:103], s[2:3], s4
3565   s_dcache_inv_vol
3566   s_memtime s[4:5]
3567
3568 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3569
3570 SOP1
3571 ++++
3572
3573 .. code-block:: nasm
3574
3575   s_mov_b32 s1, s2
3576   s_mov_b64 s[0:1], 0x80000000
3577   s_cmov_b32 s1, 200
3578   s_wqm_b64 s[2:3], s[4:5]
3579   s_bcnt0_i32_b64 s1, s[2:3]
3580   s_swappc_b64 s[2:3], s[4:5]
3581   s_cbranch_join s[4:5]
3582
3583 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3584
3585 SOP2
3586 ++++
3587
3588 .. code-block:: nasm
3589
3590   s_add_u32 s1, s2, s3
3591   s_and_b64 s[2:3], s[4:5], s[6:7]
3592   s_cselect_b32 s1, s2, s3
3593   s_andn2_b32 s2, s4, s6
3594   s_lshr_b64 s[2:3], s[4:5], s6
3595   s_ashr_i32 s2, s4, s6
3596   s_bfm_b64 s[2:3], s4, s6
3597   s_bfe_i64 s[2:3], s[4:5], s6
3598   s_cbranch_g_fork s[4:5], s[6:7]
3599
3600 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3601
3602 SOPC
3603 ++++
3604
3605 .. code-block:: nasm
3606
3607   s_cmp_eq_i32 s1, s2
3608   s_bitcmp1_b32 s1, s2
3609   s_bitcmp0_b64 s[2:3], s4
3610   s_setvskip s3, s5
3611
3612 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3613
3614 SOPP
3615 ++++
3616
3617 .. code-block:: nasm
3618
3619   s_barrier
3620   s_nop 2
3621   s_endpgm
3622   s_waitcnt 0 ; Wait for all counters to be 0
3623   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3624   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3625   s_sethalt 9
3626   s_sleep 10
3627   s_sendmsg 0x1
3628   s_sendmsg sendmsg(MSG_INTERRUPT)
3629   s_trap 1
3630
3631 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3632
3633 Unless otherwise mentioned, little verification is performed on the operands
3634 of SOPP Instructions, so it is up to the programmer to be familiar with the
3635 range or acceptable values.
3636
3637 VALU
3638 ++++
3639
3640 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3641 the assembler will automatically use optimal encoding based on its operands.
3642 To force specific encoding, one can add a suffix to the opcode of the instruction:
3643
3644 * _e32 for 32-bit VOP1/VOP2/VOPC
3645 * _e64 for 64-bit VOP3
3646 * _dpp for VOP_DPP
3647 * _sdwa for VOP_SDWA
3648
3649 VOP1/VOP2/VOP3/VOPC examples:
3650
3651 .. code-block:: nasm
3652
3653   v_mov_b32 v1, v2
3654   v_mov_b32_e32 v1, v2
3655   v_nop
3656   v_cvt_f64_i32_e32 v[1:2], v2
3657   v_floor_f32_e32 v1, v2
3658   v_bfrev_b32_e32 v1, v2
3659   v_add_f32_e32 v1, v2, v3
3660   v_mul_i32_i24_e64 v1, v2, 3
3661   v_mul_i32_i24_e32 v1, -3, v3
3662   v_mul_i32_i24_e32 v1, -100, v3
3663   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3664   v_max_f16_e32 v1, v2, v3
3665
3666 VOP_DPP examples:
3667
3668 .. code-block:: nasm
3669
3670   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3671   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3672   v_mov_b32 v0, v0 wave_shl:1
3673   v_mov_b32 v0, v0 row_mirror
3674   v_mov_b32 v0, v0 row_bcast:31
3675   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3676   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3677   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3678
3679 VOP_SDWA examples:
3680
3681 .. code-block:: nasm
3682
3683   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3684   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3685   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3686   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3687   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3688
3689 For full list of supported instructions, refer to "Vector ALU instructions".
3690
3691 HSA Code Object Directives
3692 ~~~~~~~~~~~~~~~~~~~~~~~~~~
3693
3694 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3695 one can specify them with assembler directives.
3696
3697 .hsa_code_object_version major, minor
3698 +++++++++++++++++++++++++++++++++++++
3699
3700 *major* and *minor* are integers that specify the version of the HSA code
3701 object that will be generated by the assembler.
3702
3703 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
3704 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3705
3706
3707 *major*, *minor*, and *stepping* are all integers that describe the instruction
3708 set architecture (ISA) version of the assembly program.
3709
3710 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
3711 "AMD" and *arch* should always be equal to "AMDGPU".
3712
3713 By default, the assembler will derive the ISA version, *vendor*, and *arch*
3714 from the value of the -mcpu option that is passed to the assembler.
3715
3716 .amdgpu_hsa_kernel (name)
3717 +++++++++++++++++++++++++
3718
3719 This directives specifies that the symbol with given name is a kernel entry point
3720 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
3721
3722 .amd_kernel_code_t
3723 ++++++++++++++++++
3724
3725 This directive marks the beginning of a list of key / value pairs that are used
3726 to specify the amd_kernel_code_t object that will be emitted by the assembler.
3727 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
3728 any amd_kernel_code_t values that are unspecified a default value will be
3729 used.  The default value for all keys is 0, with the following exceptions:
3730
3731 - *kernel_code_version_major* defaults to 1.
3732 - *machine_kind* defaults to 1.
3733 - *machine_version_major*, *machine_version_minor*, and
3734   *machine_version_stepping* are derived from the value of the -mcpu option
3735   that is passed to the assembler.
3736 - *kernel_code_entry_byte_offset* defaults to 256.
3737 - *wavefront_size* defaults to 6.
3738 - *kernarg_segment_alignment*, *group_segment_alignment*, and
3739   *private_segment_alignment* default to 4.  Note that alignments are specified
3740   as a power of two, so a value of **n** means an alignment of 2^ **n**.
3741
3742 The *.amd_kernel_code_t* directive must be placed immediately after the
3743 function label and before any instructions.
3744
3745 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3746 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
3747
3748 Here is an example of a minimal amd_kernel_code_t specification:
3749
3750 .. code-block:: none
3751
3752    .hsa_code_object_version 1,0
3753    .hsa_code_object_isa
3754
3755    .hsatext
3756    .globl  hello_world
3757    .p2align 8
3758    .amdgpu_hsa_kernel hello_world
3759
3760    hello_world:
3761
3762       .amd_kernel_code_t
3763          enable_sgpr_kernarg_segment_ptr = 1
3764          is_ptr64 = 1
3765          compute_pgm_rsrc1_vgprs = 0
3766          compute_pgm_rsrc1_sgprs = 0
3767          compute_pgm_rsrc2_user_sgpr = 2
3768          kernarg_segment_byte_size = 8
3769          wavefront_sgpr_count = 2
3770          workitem_vgpr_count = 3
3771      .end_amd_kernel_code_t
3772
3773      s_load_dwordx2 s[0:1], s[0:1] 0x0
3774      v_mov_b32 v0, 3.14159
3775      s_waitcnt lgkmcnt(0)
3776      v_mov_b32 v1, s0
3777      v_mov_b32 v2, s1
3778      flat_store_dword v[1:2], v0
3779      s_endpgm
3780    .Lfunc_end0:
3781         .size   hello_world, .Lfunc_end0-hello_world
3782
3783 Additional Documentation
3784 ========================
3785
3786 .. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3787 .. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3788 .. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3789 .. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3790 .. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3791 .. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3792 .. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
3793 .. [AMD-Vega] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
3794 .. [AMD-OpenCL_Programming-Guide]  `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3795 .. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3796 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3797 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3798 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3799 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3800 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3801 .. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3802 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3803 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3804 .. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__