Fix uninitialized variable
[llvm-core.git] / docs / AMDGPUUsage.rst
blob1ddda1bae9ec5b1aba10329591d7936a46d364a7
1 =============================
2 User Guide for AMDGPU Backend
3 =============================
5 .. contents::
6    :local:
8 Introduction
9 ============
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
15 LLVM
16 ====
18 .. _amdgpu-target-triples:
20 Target Triples
21 --------------
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26   .. table:: AMDGPU Architectures
27      :name: amdgpu-architecture-table
29      ============ ==============================================================
30      Architecture Description
31      ============ ==============================================================
32      ``r600``     AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33      ``amdgcn``   AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34      ============ ==============================================================
36   .. table:: AMDGPU Vendors
37      :name: amdgpu-vendor-table
39      ============ ==============================================================
40      Vendor       Description
41      ============ ==============================================================
42      ``amd``      Can be used for all AMD GPU usage.
43      ``mesa3d``   Can be used if the OS is ``mesa3d``.
44      ============ ==============================================================
46   .. table:: AMDGPU Operating Systems
47      :name: amdgpu-os-table
49      ============== ============================================================
50      OS             Description
51      ============== ============================================================
52      *<empty>*      Defaults to the *unknown* OS.
53      ``amdhsa``     Compute kernels executed on HSA [HSA]_ compatible runtimes
54                     such as AMD's ROCm [AMD-ROCm]_.
55      ``amdpal``     Graphic shaders and compute kernels executed on AMD PAL
56                     runtime.
57      ``mesa3d``     Graphic shaders and compute kernels executed on Mesa 3D
58                     runtime.
59      ============== ============================================================
61   .. table:: AMDGPU Environments
62      :name: amdgpu-environment-table
64      ============ ==============================================================
65      Environment  Description
66      ============ ==============================================================
67      *<empty>*    Default.
68      ============ ==============================================================
70 .. _amdgpu-processors:
72 Processors
73 ----------
75 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76 names from both the *Processor* and *Alternative Processor* can be used.
78   .. table:: AMDGPU Processors
79      :name: amdgpu-processor-table
81      =========== =============== ============ ===== ========= ======= ==================
82      Processor   Alternative     Target       dGPU/ Target    ROCm    Example
83                  Processor       Triple       APU   Features  Support Products
84                                  Architecture       Supported
85                                                     [Default]
86      =========== =============== ============ ===== ========= ======= ==================
87      **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
88      -----------------------------------------------------------------------------------
89      ``r600``                    ``r600``     dGPU
90      ``r630``                    ``r600``     dGPU
91      ``rs880``                   ``r600``     dGPU
92      ``rv670``                   ``r600``     dGPU
93      **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
94      -----------------------------------------------------------------------------------
95      ``rv710``                   ``r600``     dGPU
96      ``rv730``                   ``r600``     dGPU
97      ``rv770``                   ``r600``     dGPU
98      **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
99      -----------------------------------------------------------------------------------
100      ``cedar``                   ``r600``     dGPU
101      ``cypress``                 ``r600``     dGPU
102      ``juniper``                 ``r600``     dGPU
103      ``redwood``                 ``r600``     dGPU
104      ``sumo``                    ``r600``     dGPU
105      **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
106      -----------------------------------------------------------------------------------
107      ``barts``                   ``r600``     dGPU
108      ``caicos``                  ``r600``     dGPU
109      ``cayman``                  ``r600``     dGPU
110      ``turks``                   ``r600``     dGPU
111      **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
112      -----------------------------------------------------------------------------------
113      ``gfx600``  - ``tahiti``    ``amdgcn``   dGPU
114      ``gfx601``  - ``hainan``    ``amdgcn``   dGPU
115                  - ``oland``
116                  - ``pitcairn``
117                  - ``verde``
118      **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
119      -----------------------------------------------------------------------------------
120      ``gfx700``  - ``kaveri``    ``amdgcn``   APU                     - A6-7000
121                                                                       - A6 Pro-7050B
122                                                                       - A8-7100
123                                                                       - A8 Pro-7150B
124                                                                       - A10-7300
125                                                                       - A10 Pro-7350B
126                                                                       - FX-7500
127                                                                       - A8-7200P
128                                                                       - A10-7400P
129                                                                       - FX-7600P
130      ``gfx701``  - ``hawaii``    ``amdgcn``   dGPU            ROCm    - FirePro W8100
131                                                                       - FirePro W9100
132                                                                       - FirePro S9150
133                                                                       - FirePro S9170
134      ``gfx702``                  ``amdgcn``   dGPU            ROCm    - Radeon R9 290
135                                                                       - Radeon R9 290x
136                                                                       - Radeon R390
137                                                                       - Radeon R390x
138      ``gfx703``  - ``kabini``    ``amdgcn``   APU                     - E1-2100
139                  - ``mullins``                                        - E1-2200
140                                                                       - E1-2500
141                                                                       - E2-3000
142                                                                       - E2-3800
143                                                                       - A4-5000
144                                                                       - A4-5100
145                                                                       - A6-5200
146                                                                       - A4 Pro-3340B
147      ``gfx704``  - ``bonaire``   ``amdgcn``   dGPU                    - Radeon HD 7790
148                                                                       - Radeon HD 8770
149                                                                       - R7 260
150                                                                       - R7 260X
151      **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
152      -----------------------------------------------------------------------------------
153      ``gfx801``  - ``carrizo``   ``amdgcn``   APU   - xnack           - A6-8500P
154                                                       [on]            - Pro A6-8500B
155                                                                       - A8-8600P
156                                                                       - Pro A8-8600B
157                                                                       - FX-8800P
158                                                                       - Pro A12-8800B
159      \                           ``amdgcn``   APU   - xnack   ROCm    - A10-8700P
160                                                       [on]            - Pro A10-8700B
161                                                                       - A10-8780P
162      \                           ``amdgcn``   APU   - xnack           - A10-9600P
163                                                       [on]            - A10-9630P
164                                                                       - A12-9700P
165                                                                       - A12-9730P
166                                                                       - FX-9800P
167                                                                       - FX-9830P
168      \                           ``amdgcn``   APU   - xnack           - E2-9010
169                                                       [on]            - A6-9210
170                                                                       - A9-9410
171      ``gfx802``  - ``iceland``   ``amdgcn``   dGPU  - xnack   ROCm    - FirePro S7150
172                  - ``tonga``                          [off]           - FirePro S7100
173                                                                       - FirePro W7100
174                                                                       - Radeon R285
175                                                                       - Radeon R9 380
176                                                                       - Radeon R9 385
177                                                                       - Mobile FirePro
178                                                                         M7170
179      ``gfx803``  - ``fiji``      ``amdgcn``   dGPU  - xnack   ROCm    - Radeon R9 Nano
180                                                       [off]           - Radeon R9 Fury
181                                                                       - Radeon R9 FuryX
182                                                                       - Radeon Pro Duo
183                                                                       - FirePro S9300x2
184                                                                       - Radeon Instinct MI8
185      \           - ``polaris10`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 470
186                                                       [off]           - Radeon RX 480
187                                                                       - Radeon Instinct MI6
188      \           - ``polaris11`` ``amdgcn``   dGPU  - xnack   ROCm    - Radeon RX 460
189                                                       [off]
190      ``gfx810``  - ``stoney``    ``amdgcn``   APU   - xnack
191                                                       [on]
192      **GCN GFX9** [AMD-GCN-GFX9]_
193      -----------------------------------------------------------------------------------
194      ``gfx900``                  ``amdgcn``   dGPU  - xnack   ROCm    - Radeon Vega
195                                                       [off]             Frontier Edition
196                                                                       - Radeon RX Vega 56
197                                                                       - Radeon RX Vega 64
198                                                                       - Radeon RX Vega 64
199                                                                         Liquid
200                                                                       - Radeon Instinct MI25
201      ``gfx902``                  ``amdgcn``   APU   - xnack           - Ryzen 3 2200G
202                                                       [on]            - Ryzen 5 2400G
203      ``gfx904``                  ``amdgcn``   dGPU  - xnack           *TBA*
204                                                       [off]
205                                                                       .. TODO
206                                                                          Add product
207                                                                          names.
208      ``gfx906``                  ``amdgcn``   dGPU  - xnack           *TBA*
209                                                       [off]
210                                                                       .. TODO
211                                                                          Add product
212                                                                          names.
213      =========== =============== ============ ===== ========= ======= ==================
215 .. _amdgpu-target-features:
217 Target Features
218 ---------------
220 Target features control how code is generated to support certain
221 processor specific features. Not all target features are supported by
222 all processors. The runtime must ensure that the features supported by
223 the device used to execute the code match the features enabled when
224 generating the code. A mismatch of features may result in incorrect
225 execution, or a reduction in performance.
227 The target features supported by each processor, and the default value
228 used if not specified explicitly, is listed in
229 :ref:`amdgpu-processor-table`.
231 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
232 target features.
234 For example:
236 ``-mxnack``
237   Enable the ``xnack`` feature.
238 ``-mno-xnack``
239   Disable the ``xnack`` feature.
241   .. table:: AMDGPU Target Features
242      :name: amdgpu-target-feature-table
244      ============== ==================================================
245      Target Feature Description
246      ============== ==================================================
247      -m[no-]xnack   Enable/disable generating code that has
248                     memory clauses that are compatible with
249                     having XNACK replay enabled.
251                     This is used for demand paging and page
252                     migration. If XNACK replay is enabled in
253                     the device, then if a page fault occurs
254                     the code may execute incorrectly if the
255                     ``xnack`` feature is not enabled. Executing
256                     code that has the feature enabled on a
257                     device that does not have XNACK replay
258                     enabled will execute correctly, but may
259                     be less performant than code with the
260                     feature disabled.
261      ============== ==================================================
263 .. _amdgpu-address-spaces:
265 Address Spaces
266 --------------
268 The AMDGPU backend uses the following address space mappings.
270 The memory space names used in the table, aside from the region memory space, is
271 from the OpenCL standard.
273 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
275   .. table:: Address Space Mapping
276      :name: amdgpu-address-space-mapping-table
278      ================== =================
279      LLVM Address Space Memory Space
280      ================== =================
281      0                  Generic (Flat)
282      1                  Global
283      2                  Region (GDS)
284      3                  Local (group/LDS)
285      4                  Constant
286      5                  Private (Scratch)
287      6                  Constant 32-bit
288      ================== =================
290 .. _amdgpu-memory-scopes:
292 Memory Scopes
293 -------------
295 This section provides LLVM memory synchronization scopes supported by the AMDGPU
296 backend memory model when the target triple OS is ``amdhsa`` (see
297 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
299 The memory model supported is based on the HSA memory model [HSA]_ which is
300 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
301 relation is transitive over the synchonizes-with relation independent of scope,
302 and synchonizes-with allows the memory scope instances to be inclusive (see
303 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
305 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
306 inclusion and requires the memory scopes to exactly match. However, this
307 is conservatively correct for OpenCL.
309   .. table:: AMDHSA LLVM Sync Scopes
310      :name: amdgpu-amdhsa-llvm-sync-scopes-table
312      ================ ==========================================================
313      LLVM Sync Scope  Description
314      ================ ==========================================================
315      *none*           The default: ``system``.
317                       Synchronizes with, and participates in modification and
318                       seq_cst total orderings with, other operations (except
319                       image operations) for all address spaces (except private,
320                       or generic that accesses private) provided the other
321                       operation's sync scope is:
323                       - ``system``.
324                       - ``agent`` and executed by a thread on the same agent.
325                       - ``workgroup`` and executed by a thread in the same
326                         workgroup.
327                       - ``wavefront`` and executed by a thread in the same
328                         wavefront.
330      ``agent``        Synchronizes with, and participates in modification and
331                       seq_cst total orderings with, other operations (except
332                       image operations) for all address spaces (except private,
333                       or generic that accesses private) provided the other
334                       operation's sync scope is:
336                       - ``system`` or ``agent`` and executed by a thread on the
337                         same agent.
338                       - ``workgroup`` and executed by a thread in the same
339                         workgroup.
340                       - ``wavefront`` and executed by a thread in the same
341                         wavefront.
343      ``workgroup``    Synchronizes with, and participates in modification and
344                       seq_cst total orderings with, other operations (except
345                       image operations) for all address spaces (except private,
346                       or generic that accesses private) provided the other
347                       operation's sync scope is:
349                       - ``system``, ``agent`` or ``workgroup`` and executed by a
350                         thread in the same workgroup.
351                       - ``wavefront`` and executed by a thread in the same
352                         wavefront.
354      ``wavefront``    Synchronizes with, and participates in modification and
355                       seq_cst total orderings with, other operations (except
356                       image operations) for all address spaces (except private,
357                       or generic that accesses private) provided the other
358                       operation's sync scope is:
360                       - ``system``, ``agent``, ``workgroup`` or ``wavefront``
361                         and executed by a thread in the same wavefront.
363      ``singlethread`` Only synchronizes with, and participates in modification
364                       and seq_cst total orderings with, other operations (except
365                       image operations) running in the same thread for all
366                       address spaces (for example, in signal handlers).
367      ================ ==========================================================
369 AMDGPU Intrinsics
370 -----------------
372 The AMDGPU backend implements the following LLVM IR intrinsics.
374 *This section is WIP.*
376 .. TODO
377    List AMDGPU intrinsics
379 AMDGPU Attributes
380 -----------------
382 The AMDGPU backend supports the following LLVM IR attributes.
384   .. table:: AMDGPU LLVM IR Attributes
385      :name: amdgpu-llvm-ir-attributes-table
387      ======================================= ==========================================================
388      LLVM Attribute                          Description
389      ======================================= ==========================================================
390      "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
391                                              will be specified when the kernel is dispatched. Generated
392                                              by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
393      "amdgpu-implicitarg-num-bytes"="n"      Number of kernel argument bytes to add to the kernel
394                                              argument block size for the implicit arguments. This
395                                              varies by OS and language (for OpenCL see
396                                              :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
397      "amdgpu-max-work-group-size"="n"        Specify the maximum work-group size that will be specifed
398                                              when the kernel is dispatched.
399      "amdgpu-num-sgpr"="n"                   Specifies the number of SGPRs to use. Generated by
400                                              the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
401      "amdgpu-num-vgpr"="n"                   Specifies the number of VGPRs to use. Generated by the
402                                              ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
403      "amdgpu-waves-per-eu"="m,n"             Specify the minimum and maximum number of waves per
404                                              execution unit. Generated by the ``amdgpu_waves_per_eu``
405                                              CLANG attribute [CLANG-ATTR]_.
406      ======================================= ==========================================================
408 Code Object
409 ===========
411 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
412 can be linked by ``lld`` to produce a standard ELF shared code object which can
413 be loaded and executed on an AMDGPU target.
415 Header
416 ------
418 The AMDGPU backend uses the following ELF header:
420   .. table:: AMDGPU ELF Header
421      :name: amdgpu-elf-header-table
423      ========================== ===============================
424      Field                      Value
425      ========================== ===============================
426      ``e_ident[EI_CLASS]``      ``ELFCLASS64``
427      ``e_ident[EI_DATA]``       ``ELFDATA2LSB``
428      ``e_ident[EI_OSABI]``      - ``ELFOSABI_NONE``
429                                 - ``ELFOSABI_AMDGPU_HSA``
430                                 - ``ELFOSABI_AMDGPU_PAL``
431                                 - ``ELFOSABI_AMDGPU_MESA3D``
432      ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
433                                 - ``ELFABIVERSION_AMDGPU_PAL``
434                                 - ``ELFABIVERSION_AMDGPU_MESA3D``
435      ``e_type``                 - ``ET_REL``
436                                 - ``ET_DYN``
437      ``e_machine``              ``EM_AMDGPU``
438      ``e_entry``                0
439      ``e_flags``                See :ref:`amdgpu-elf-header-e_flags-table`
440      ========================== ===============================
444   .. table:: AMDGPU ELF Header Enumeration Values
445      :name: amdgpu-elf-header-enumeration-values-table
447      =============================== =====
448      Name                            Value
449      =============================== =====
450      ``EM_AMDGPU``                   224
451      ``ELFOSABI_NONE``               0
452      ``ELFOSABI_AMDGPU_HSA``         64
453      ``ELFOSABI_AMDGPU_PAL``         65
454      ``ELFOSABI_AMDGPU_MESA3D``      66
455      ``ELFABIVERSION_AMDGPU_HSA``    1
456      ``ELFABIVERSION_AMDGPU_PAL``    0
457      ``ELFABIVERSION_AMDGPU_MESA3D`` 0
458      =============================== =====
460 ``e_ident[EI_CLASS]``
461   The ELF class is:
463   * ``ELFCLASS32`` for ``r600`` architecture.
465   * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
466     bit applications.
468 ``e_ident[EI_DATA]``
469   All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
471 ``e_ident[EI_OSABI]``
472   One of the following AMD GPU architecture specific OS ABIs
473   (see :ref:`amdgpu-os-table`):
475   * ``ELFOSABI_NONE`` for *unknown* OS.
477   * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
479   * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
481   * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
483 ``e_ident[EI_ABIVERSION]``
484   The ABI version of the AMD GPU architecture specific OS ABI to which the code
485   object conforms:
487   * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
488     runtime ABI.
490   * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
491     runtime ABI.
493   * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
494     3D runtime ABI.
496 ``e_type``
497   Can be one of the following values:
500   ``ET_REL``
501     The type produced by the AMD GPU backend compiler as it is relocatable code
502     object.
504   ``ET_DYN``
505     The type produced by the linker as it is a shared code object.
507   The AMD HSA runtime loader requires a ``ET_DYN`` code object.
509 ``e_machine``
510   The value ``EM_AMDGPU`` is used for the machine for all processors supported
511   by the ``r600`` and ``amdgcn`` architectures (see
512   :ref:`amdgpu-processor-table`). The specific processor is specified in the
513   ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
514   :ref:`amdgpu-elf-header-e_flags-table`).
516 ``e_entry``
517   The entry point is 0 as the entry points for individual kernels must be
518   selected in order to invoke them through AQL packets.
520 ``e_flags``
521   The AMDGPU backend uses the following ELF header flags:
523   .. table:: AMDGPU ELF Header ``e_flags``
524      :name: amdgpu-elf-header-e_flags-table
526      ================================= ========== =============================
527      Name                              Value      Description
528      ================================= ========== =============================
529      **AMDGPU Processor Flag**                    See :ref:`amdgpu-processor-table`.
530      -------------------------------------------- -----------------------------
531      ``EF_AMDGPU_MACH``                0x000000ff AMDGPU processor selection
532                                                   mask for
533                                                   ``EF_AMDGPU_MACH_xxx`` values
534                                                   defined in
535                                                   :ref:`amdgpu-ef-amdgpu-mach-table`.
536      ``EF_AMDGPU_XNACK``               0x00000100 Indicates if the ``xnack``
537                                                   target feature is
538                                                   enabled for all code
539                                                   contained in the code object.
540                                                   If the processor
541                                                   does not support the
542                                                   ``xnack`` target
543                                                   feature then must
544                                                   be 0.
545                                                   See
546                                                   :ref:`amdgpu-target-features`.
547      ================================= ========== =============================
549   .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
550      :name: amdgpu-ef-amdgpu-mach-table
552      ================================= ========== =============================
553      Name                              Value      Description (see
554                                                   :ref:`amdgpu-processor-table`)
555      ================================= ========== =============================
556      ``EF_AMDGPU_MACH_NONE``           0x000      *not specified*
557      ``EF_AMDGPU_MACH_R600_R600``      0x001      ``r600``
558      ``EF_AMDGPU_MACH_R600_R630``      0x002      ``r630``
559      ``EF_AMDGPU_MACH_R600_RS880``     0x003      ``rs880``
560      ``EF_AMDGPU_MACH_R600_RV670``     0x004      ``rv670``
561      ``EF_AMDGPU_MACH_R600_RV710``     0x005      ``rv710``
562      ``EF_AMDGPU_MACH_R600_RV730``     0x006      ``rv730``
563      ``EF_AMDGPU_MACH_R600_RV770``     0x007      ``rv770``
564      ``EF_AMDGPU_MACH_R600_CEDAR``     0x008      ``cedar``
565      ``EF_AMDGPU_MACH_R600_CYPRESS``   0x009      ``cypress``
566      ``EF_AMDGPU_MACH_R600_JUNIPER``   0x00a      ``juniper``
567      ``EF_AMDGPU_MACH_R600_REDWOOD``   0x00b      ``redwood``
568      ``EF_AMDGPU_MACH_R600_SUMO``      0x00c      ``sumo``
569      ``EF_AMDGPU_MACH_R600_BARTS``     0x00d      ``barts``
570      ``EF_AMDGPU_MACH_R600_CAICOS``    0x00e      ``caicos``
571      ``EF_AMDGPU_MACH_R600_CAYMAN``    0x00f      ``cayman``
572      ``EF_AMDGPU_MACH_R600_TURKS``     0x010      ``turks``
573      *reserved*                        0x011 -    Reserved for ``r600``
574                                        0x01f      architecture processors.
575      ``EF_AMDGPU_MACH_AMDGCN_GFX600``  0x020      ``gfx600``
576      ``EF_AMDGPU_MACH_AMDGCN_GFX601``  0x021      ``gfx601``
577      ``EF_AMDGPU_MACH_AMDGCN_GFX700``  0x022      ``gfx700``
578      ``EF_AMDGPU_MACH_AMDGCN_GFX701``  0x023      ``gfx701``
579      ``EF_AMDGPU_MACH_AMDGCN_GFX702``  0x024      ``gfx702``
580      ``EF_AMDGPU_MACH_AMDGCN_GFX703``  0x025      ``gfx703``
581      ``EF_AMDGPU_MACH_AMDGCN_GFX704``  0x026      ``gfx704``
582      *reserved*                        0x027      Reserved.
583      ``EF_AMDGPU_MACH_AMDGCN_GFX801``  0x028      ``gfx801``
584      ``EF_AMDGPU_MACH_AMDGCN_GFX802``  0x029      ``gfx802``
585      ``EF_AMDGPU_MACH_AMDGCN_GFX803``  0x02a      ``gfx803``
586      ``EF_AMDGPU_MACH_AMDGCN_GFX810``  0x02b      ``gfx810``
587      ``EF_AMDGPU_MACH_AMDGCN_GFX900``  0x02c      ``gfx900``
588      ``EF_AMDGPU_MACH_AMDGCN_GFX902``  0x02d      ``gfx902``
589      ``EF_AMDGPU_MACH_AMDGCN_GFX904``  0x02e      ``gfx904``
590      ``EF_AMDGPU_MACH_AMDGCN_GFX906``  0x02f      ``gfx906``
591      *reserved*                        0x030      Reserved.
592      ================================= ========== =============================
594 Sections
595 --------
597 An AMDGPU target ELF code object has the standard ELF sections which include:
599   .. table:: AMDGPU ELF Sections
600      :name: amdgpu-elf-sections-table
602      ================== ================ =================================
603      Name               Type             Attributes
604      ================== ================ =================================
605      ``.bss``           ``SHT_NOBITS``   ``SHF_ALLOC`` + ``SHF_WRITE``
606      ``.data``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
607      ``.debug_``\ *\**  ``SHT_PROGBITS`` *none*
608      ``.dynamic``       ``SHT_DYNAMIC``  ``SHF_ALLOC``
609      ``.dynstr``        ``SHT_PROGBITS`` ``SHF_ALLOC``
610      ``.dynsym``        ``SHT_PROGBITS`` ``SHF_ALLOC``
611      ``.got``           ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
612      ``.hash``          ``SHT_HASH``     ``SHF_ALLOC``
613      ``.note``          ``SHT_NOTE``     *none*
614      ``.rela``\ *name*  ``SHT_RELA``     *none*
615      ``.rela.dyn``      ``SHT_RELA``     *none*
616      ``.rodata``        ``SHT_PROGBITS`` ``SHF_ALLOC``
617      ``.shstrtab``      ``SHT_STRTAB``   *none*
618      ``.strtab``        ``SHT_STRTAB``   *none*
619      ``.symtab``        ``SHT_SYMTAB``   *none*
620      ``.text``          ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
621      ================== ================ =================================
623 These sections have their standard meanings (see [ELF]_) and are only generated
624 if needed.
626 ``.debug``\ *\**
627   The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
628   DWARF produced by the AMDGPU backend.
630 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
631   The standard sections used by a dynamic loader.
633 ``.note``
634   See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
635   backend.
637 ``.rela``\ *name*, ``.rela.dyn``
638   For relocatable code objects, *name* is the name of the section that the
639   relocation records apply. For example, ``.rela.text`` is the section name for
640   relocation records associated with the ``.text`` section.
642   For linked shared code objects, ``.rela.dyn`` contains all the relocation
643   records from each of the relocatable code object's ``.rela``\ *name* sections.
645   See :ref:`amdgpu-relocation-records` for the relocation records supported by
646   the AMDGPU backend.
648 ``.text``
649   The executable machine code for the kernels and functions they call. Generated
650   as position independent code. See :ref:`amdgpu-code-conventions` for
651   information on conventions used in the isa generation.
653 .. _amdgpu-note-records:
655 Note Records
656 ------------
658 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
659 be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
660 aligned. In addition, minimal zero byte padding must be generated to ensure the
661 ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
662 ``.note`` section must be at least 4 to indicate at least 8 byte alignment.
664 The AMDGPU backend code object uses the following ELF note records in the
665 ``.note`` section. The *Description* column specifies the layout of the note
666 record's ``desc`` field. All fields are consecutive bytes. Note records with
667 variable size strings have a corresponding ``*_size`` field that specifies the
668 number of bytes, including the terminating null character, in the string. The
669 string(s) come immediately after the preceding fields.
671 Additional note records can be present.
673   .. table:: AMDGPU ELF Note Records
674      :name: amdgpu-elf-note-records-table
676      ===== ============================== ======================================
677      Name  Type                           Description
678      ===== ============================== ======================================
679      "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
680      ===== ============================== ======================================
684   .. table:: AMDGPU ELF Note Record Enumeration Values
685      :name: amdgpu-elf-note-record-enumeration-values-table
687      ============================== =====
688      Name                           Value
689      ============================== =====
690      *reserved*                       0-9
691      ``NT_AMD_AMDGPU_HSA_METADATA``    10
692      *reserved*                        11
693      ============================== =====
695 ``NT_AMD_AMDGPU_HSA_METADATA``
696   Specifies extensible metadata associated with the code objects executed on HSA
697   [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
698   the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
699   :ref:`amdgpu-amdhsa-code-object-metadata` for the syntax of the code
700   object metadata string.
702 .. _amdgpu-symbols:
704 Symbols
705 -------
707 Symbols include the following:
709   .. table:: AMDGPU ELF Symbols
710      :name: amdgpu-elf-symbols-table
712      ===================== ============== ============= ==================
713      Name                  Type           Section       Description
714      ===================== ============== ============= ==================
715      *link-name*           ``STT_OBJECT`` - ``.data``   Global variable
716                                           - ``.rodata``
717                                           - ``.bss``
718      *link-name*\ ``.kd``  ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
719      *link-name*           ``STT_FUNC``   - ``.text``   Kernel entry point
720      ===================== ============== ============= ==================
722 Global variable
723   Global variables both used and defined by the compilation unit.
725   If the symbol is defined in the compilation unit then it is allocated in the
726   appropriate section according to if it has initialized data or is readonly.
728   If the symbol is external then its section is ``STN_UNDEF`` and the loader
729   will resolve relocations using the definition provided by another code object
730   or explicitly defined by the runtime.
732   All global symbols, whether defined in the compilation unit or external, are
733   accessed by the machine code indirectly through a GOT table entry. This
734   allows them to be preemptable. The GOT table is only supported when the target
735   triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
737   .. TODO
738      Add description of linked shared object symbols. Seems undefined symbols
739      are marked as STT_NOTYPE.
741 Kernel descriptor
742   Every HSA kernel has an associated kernel descriptor. It is the address of the
743   kernel descriptor that is used in the AQL dispatch packet used to invoke the
744   kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
745   defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
747 Kernel entry point
748   Every HSA kernel also has a symbol for its machine code entry point.
750 .. _amdgpu-relocation-records:
752 Relocation Records
753 ------------------
755 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
756 relocatable fields are:
758 ``word32``
759   This specifies a 32-bit field occupying 4 bytes with arbitrary byte
760   alignment. These values use the same byte order as other word values in the
761   AMD GPU architecture.
763 ``word64``
764   This specifies a 64-bit field occupying 8 bytes with arbitrary byte
765   alignment. These values use the same byte order as other word values in the
766   AMD GPU architecture.
768 Following notations are used for specifying relocation calculations:
770 **A**
771   Represents the addend used to compute the value of the relocatable field.
773 **G**
774   Represents the offset into the global offset table at which the relocation
775   entry's symbol will reside during execution.
777 **GOT**
778   Represents the address of the global offset table.
780 **P**
781   Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
782   of the storage unit being relocated (computed using ``r_offset``).
784 **S**
785   Represents the value of the symbol whose index resides in the relocation
786   entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
788 **B**
789   Represents the base address of a loaded executable or shared object which is
790   the difference between the ELF address and the actual load address. Relocations
791   using this are only valid in executable or shared objects.
793 The following relocation types are supported:
795   .. table:: AMDGPU ELF Relocation Records
796      :name: amdgpu-elf-relocation-records-table
798      ========================== ======= =====  ==========  ==============================
799      Relocation Type            Kind    Value  Field       Calculation
800      ========================== ======= =====  ==========  ==============================
801      ``R_AMDGPU_NONE``                  0      *none*      *none*
802      ``R_AMDGPU_ABS32_LO``      Static, 1      ``word32``  (S + A) & 0xFFFFFFFF
803                                 Dynamic
804      ``R_AMDGPU_ABS32_HI``      Static, 2      ``word32``  (S + A) >> 32
805                                 Dynamic
806      ``R_AMDGPU_ABS64``         Static, 3      ``word64``  S + A
807                                 Dynamic
808      ``R_AMDGPU_REL32``         Static  4      ``word32``  S + A - P
809      ``R_AMDGPU_REL64``         Static  5      ``word64``  S + A - P
810      ``R_AMDGPU_ABS32``         Static, 6      ``word32``  S + A
811                                 Dynamic
812      ``R_AMDGPU_GOTPCREL``      Static  7      ``word32``  G + GOT + A - P
813      ``R_AMDGPU_GOTPCREL32_LO`` Static  8      ``word32``  (G + GOT + A - P) & 0xFFFFFFFF
814      ``R_AMDGPU_GOTPCREL32_HI`` Static  9      ``word32``  (G + GOT + A - P) >> 32
815      ``R_AMDGPU_REL32_LO``      Static  10     ``word32``  (S + A - P) & 0xFFFFFFFF
816      ``R_AMDGPU_REL32_HI``      Static  11     ``word32``  (S + A - P) >> 32
817      *reserved*                         12
818      ``R_AMDGPU_RELATIVE64``    Dynamic 13     ``word64``  B + A
819      ========================== ======= =====  ==========  ==============================
821 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
822 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
824 There is no current OS loader support for 32 bit programs and so
825 ``R_AMDGPU_ABS32`` is not used.
827 .. _amdgpu-dwarf:
829 DWARF
830 -----
832 Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
833 information that maps the code object executable code and data to the source
834 language constructs. It can be used by tools such as debuggers and profilers.
836 Address Space Mapping
837 ~~~~~~~~~~~~~~~~~~~~~
839 The following address space mapping is used:
841   .. table:: AMDGPU DWARF Address Space Mapping
842      :name: amdgpu-dwarf-address-space-mapping-table
844      =================== =================
845      DWARF Address Space Memory Space
846      =================== =================
847      1                   Private (Scratch)
848      2                   Local (group/LDS)
849      *omitted*           Global
850      *omitted*           Constant
851      *omitted*           Generic (Flat)
852      *not supported*     Region (GDS)
853      =================== =================
855 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
856 used in the table.
858 An ``address_class`` attribute is generated on pointer type DIEs to specify the
859 DWARF address space of the value of the pointer when it is in the *private* or
860 *local* address space. Otherwise the attribute is omitted.
862 An ``XDEREF`` operation is generated in location list expressions for variables
863 that are allocated in the *private* and *local* address space. Otherwise no
864 ``XDREF`` is omitted.
866 Register Mapping
867 ~~~~~~~~~~~~~~~~
869 *This section is WIP.*
871 .. TODO
872    Define DWARF register enumeration.
874    If want to present a wavefront state then should expose vector registers as
875    64 wide (rather than per work-item view that LLVM uses). Either as separate
876    registers, or a 64x4 byte single register. In either case use a new LANE op
877    (akin to XDREF) to select the current lane usage in a location
878    expression. This would also allow scalar register spilling to vector register
879    lanes to be expressed (currently no debug information is being generated for
880    spilling). If choose a wide single register approach then use LANE in
881    conjunction with PIECE operation to select the dword part of the register for
882    the current lane. If the separate register approach then use LANE to select
883    the register.
885 Source Text
886 ~~~~~~~~~~~
888 Source text for online-compiled programs (e.g. those compiled by the OpenCL
889 runtime) may be embedded into the DWARF v5 line table using the ``clang
890 -gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
892 For example:
894 ``-gembed-source``
895   Enable the embedded source DWARF v5 extension.
896 ``-gno-embed-source``
897   Disable the embedded source DWARF v5 extension.
899   .. table:: AMDGPU Debug Options
900      :name: amdgpu-debug-options
902      ==================== ==================================================
903      Debug Flag           Description
904      ==================== ==================================================
905      -g[no-]embed-source  Enable/disable embedding source text in DWARF
906                           debug sections. Useful for environments where
907                           source cannot be written to disk, such as
908                           when performing online compilation.
909      ==================== ==================================================
911 This option enables one extended content types in the DWARF v5 Line Number
912 Program Header, which is used to encode embedded source.
914   .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
915      :name: amdgpu-dwarf-extended-content-types
917      ============================  ======================
918      Content Type                  Form
919      ============================  ======================
920      ``DW_LNCT_LLVM_source``       ``DW_FORM_line_strp``
921      ============================  ======================
923 The source field will contain the UTF-8 encoded, null-terminated source text
924 with ``'\n'`` line endings. When the source field is present, consumers can use
925 the embedded source instead of attempting to discover the source on disk. When
926 the source field is absent, consumers can access the file to get the source
927 text.
929 The above content type appears in the ``file_name_entry_format`` field of the
930 line table prologue, and its corresponding value appear in the ``file_names``
931 field. The current encoding of the content type is documented in table
932 :ref:`amdgpu-dwarf-extended-content-types-encoding`
934   .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
935      :name: amdgpu-dwarf-extended-content-types-encoding
937      ============================  ====================
938      Content Type                  Value
939      ============================  ====================
940      ``DW_LNCT_LLVM_source``       0x2001
941      ============================  ====================
943 .. _amdgpu-code-conventions:
945 Code Conventions
946 ================
948 This section provides code conventions used for each supported target triple OS
949 (see :ref:`amdgpu-target-triples`).
951 AMDHSA
952 ------
954 This section provides code conventions used when the target triple OS is
955 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
957 .. _amdgpu-amdhsa-code-object-target-identification:
959 Code Object Target Identification
960 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
962 The AMDHSA OS uses the following syntax to specify the code object
963 target as a single string:
965   ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
967 Where:
969   - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
970     are the same as the *Target Triple* (see
971     :ref:`amdgpu-target-triples`).
973   - ``<Processor>`` is the same as the *Processor* (see
974     :ref:`amdgpu-processors`).
976   - ``<Target Features>`` is a list of the enabled *Target Features*
977     (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
978     apply to *Processor*. The list must be in the same order as listed
979     in the table :ref:`amdgpu-target-feature-table`. Note that *Target
980     Features* must be included in the list if they are enabled even if
981     that is the default for *Processor*.
983 For example:
985   ``"amdgcn-amd-amdhsa--gfx902+xnack"``
987 .. _amdgpu-amdhsa-code-object-metadata:
989 Code Object Metadata
990 ~~~~~~~~~~~~~~~~~~~~
992 The code object metadata specifies extensible metadata associated with the code
993 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
994 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
995 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
996 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
997 information necessary to support the ROCM kernel queries. For example, the
998 segment sizes needed in a dispatch packet. In addition, a high level language
999 runtime may require other information to be included. For example, the AMD
1000 OpenCL runtime records kernel argument information.
1002 The metadata is specified as a YAML formatted string (see [YAML]_ and
1003 :doc:`YamlIO`).
1005 .. TODO
1006    Is the string null terminated? It probably should not if YAML allows it to
1007    contain null characters, otherwise it should be.
1009 The metadata is represented as a single YAML document comprised of the mapping
1010 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
1011 referenced tables.
1013 For boolean values, the string values of ``false`` and ``true`` are used for
1014 false and true respectively.
1016 Additional information can be added to the mappings. To avoid conflicts, any
1017 non-AMD key names should be prefixed by "*vendor-name*.".
1019   .. table:: AMDHSA Code Object Metadata Mapping
1020      :name: amdgpu-amdhsa-code-object-metadata-mapping-table
1022      ========== ============== ========= =======================================
1023      String Key Value Type     Required? Description
1024      ========== ============== ========= =======================================
1025      "Version"  sequence of    Required  - The first integer is the major
1026                 2 integers                 version. Currently 1.
1027                                          - The second integer is the minor
1028                                            version. Currently 0.
1029      "Printf"   sequence of              Each string is encoded information
1030                 strings                  about a printf function call. The
1031                                          encoded information is organized as
1032                                          fields separated by colon (':'):
1034                                          ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1036                                          where:
1038                                          ``ID``
1039                                            A 32 bit integer as a unique id for
1040                                            each printf function call
1042                                          ``N``
1043                                            A 32 bit integer equal to the number
1044                                            of arguments of printf function call
1045                                            minus 1
1047                                          ``S[i]`` (where i = 0, 1, ... , N-1)
1048                                            32 bit integers for the size in bytes
1049                                            of the i-th FormatString argument of
1050                                            the printf function call
1052                                          FormatString
1053                                            The format string passed to the
1054                                            printf function call.
1055      "Kernels"  sequence of    Required  Sequence of the mappings for each
1056                 mapping                  kernel in the code object. See
1057                                          :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
1058                                          for the definition of the mapping.
1059      ========== ============== ========= =======================================
1063   .. table:: AMDHSA Code Object Kernel Metadata Mapping
1064      :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
1066      ================= ============== ========= ================================
1067      String Key        Value Type     Required? Description
1068      ================= ============== ========= ================================
1069      "Name"            string         Required  Source name of the kernel.
1070      "SymbolName"      string         Required  Name of the kernel
1071                                                 descriptor ELF symbol.
1072      "Language"        string                   Source language of the kernel.
1073                                                 Values include:
1075                                                 - "OpenCL C"
1076                                                 - "OpenCL C++"
1077                                                 - "HCC"
1078                                                 - "OpenMP"
1080      "LanguageVersion" sequence of              - The first integer is the major
1081                        2 integers                 version.
1082                                                 - The second integer is the
1083                                                   minor version.
1084      "Attrs"           mapping                  Mapping of kernel attributes.
1085                                                 See
1086                                                 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1087                                                 for the mapping definition.
1088      "Args"            sequence of              Sequence of mappings of the
1089                        mapping                  kernel arguments. See
1090                                                 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1091                                                 for the definition of the mapping.
1092      "CodeProps"       mapping                  Mapping of properties related to
1093                                                 the kernel code. See
1094                                                 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1095                                                 for the mapping definition.
1096      ================= ============== ========= ================================
1100   .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1101      :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1103      =================== ============== ========= ==============================
1104      String Key          Value Type     Required? Description
1105      =================== ============== ========= ==============================
1106      "ReqdWorkGroupSize" sequence of              If not 0, 0, 0 then all values
1107                          3 integers               must be >=1 and the dispatch
1108                                                   work-group size X, Y, Z must
1109                                                   correspond to the specified
1110                                                   values. Defaults to 0, 0, 0.
1112                                                   Corresponds to the OpenCL
1113                                                   ``reqd_work_group_size``
1114                                                   attribute.
1115      "WorkGroupSizeHint" sequence of              The dispatch work-group size
1116                          3 integers               X, Y, Z is likely to be the
1117                                                   specified values.
1119                                                   Corresponds to the OpenCL
1120                                                   ``work_group_size_hint``
1121                                                   attribute.
1122      "VecTypeHint"       string                   The name of a scalar or vector
1123                                                   type.
1125                                                   Corresponds to the OpenCL
1126                                                   ``vec_type_hint`` attribute.
1128      "RuntimeHandle"     string                   The external symbol name
1129                                                   associated with a kernel.
1130                                                   OpenCL runtime allocates a
1131                                                   global buffer for the symbol
1132                                                   and saves the kernel's address
1133                                                   to it, which is used for
1134                                                   device side enqueueing. Only
1135                                                   available for device side
1136                                                   enqueued kernels.
1137      =================== ============== ========= ==============================
1141   .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1142      :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1144      ================= ============== ========= ================================
1145      String Key        Value Type     Required? Description
1146      ================= ============== ========= ================================
1147      "Name"            string                   Kernel argument name.
1148      "TypeName"        string                   Kernel argument type name.
1149      "Size"            integer        Required  Kernel argument size in bytes.
1150      "Align"           integer        Required  Kernel argument alignment in
1151                                                 bytes. Must be a power of two.
1152      "ValueKind"       string         Required  Kernel argument kind that
1153                                                 specifies how to set up the
1154                                                 corresponding argument.
1155                                                 Values include:
1157                                                 "ByValue"
1158                                                   The argument is copied
1159                                                   directly into the kernarg.
1161                                                 "GlobalBuffer"
1162                                                   A global address space pointer
1163                                                   to the buffer data is passed
1164                                                   in the kernarg.
1166                                                 "DynamicSharedPointer"
1167                                                   A group address space pointer
1168                                                   to dynamically allocated LDS
1169                                                   is passed in the kernarg.
1171                                                 "Sampler"
1172                                                   A global address space
1173                                                   pointer to a S# is passed in
1174                                                   the kernarg.
1176                                                 "Image"
1177                                                   A global address space
1178                                                   pointer to a T# is passed in
1179                                                   the kernarg.
1181                                                 "Pipe"
1182                                                   A global address space pointer
1183                                                   to an OpenCL pipe is passed in
1184                                                   the kernarg.
1186                                                 "Queue"
1187                                                   A global address space pointer
1188                                                   to an OpenCL device enqueue
1189                                                   queue is passed in the
1190                                                   kernarg.
1192                                                 "HiddenGlobalOffsetX"
1193                                                   The OpenCL grid dispatch
1194                                                   global offset for the X
1195                                                   dimension is passed in the
1196                                                   kernarg.
1198                                                 "HiddenGlobalOffsetY"
1199                                                   The OpenCL grid dispatch
1200                                                   global offset for the Y
1201                                                   dimension is passed in the
1202                                                   kernarg.
1204                                                 "HiddenGlobalOffsetZ"
1205                                                   The OpenCL grid dispatch
1206                                                   global offset for the Z
1207                                                   dimension is passed in the
1208                                                   kernarg.
1210                                                 "HiddenNone"
1211                                                   An argument that is not used
1212                                                   by the kernel. Space needs to
1213                                                   be left for it, but it does
1214                                                   not need to be set up.
1216                                                 "HiddenPrintfBuffer"
1217                                                   A global address space pointer
1218                                                   to the runtime printf buffer
1219                                                   is passed in kernarg.
1221                                                 "HiddenDefaultQueue"
1222                                                   A global address space pointer
1223                                                   to the OpenCL device enqueue
1224                                                   queue that should be used by
1225                                                   the kernel by default is
1226                                                   passed in the kernarg.
1228                                                 "HiddenCompletionAction"
1229                                                   A global address space pointer
1230                                                   to help link enqueued kernels into
1231                                                   the ancestor tree for determining
1232                                                   when the parent kernel has finished.
1234      "ValueType"       string         Required  Kernel argument value type. Only
1235                                                 present if "ValueKind" is
1236                                                 "ByValue". For vector data
1237                                                 types, the value is for the
1238                                                 element type. Values include:
1240                                                 - "Struct"
1241                                                 - "I8"
1242                                                 - "U8"
1243                                                 - "I16"
1244                                                 - "U16"
1245                                                 - "F16"
1246                                                 - "I32"
1247                                                 - "U32"
1248                                                 - "F32"
1249                                                 - "I64"
1250                                                 - "U64"
1251                                                 - "F64"
1253                                                 .. TODO
1254                                                    How can it be determined if a
1255                                                    vector type, and what size
1256                                                    vector?
1257      "PointeeAlign"    integer                  Alignment in bytes of pointee
1258                                                 type for pointer type kernel
1259                                                 argument. Must be a power
1260                                                 of 2. Only present if
1261                                                 "ValueKind" is
1262                                                 "DynamicSharedPointer".
1263      "AddrSpaceQual"   string                   Kernel argument address space
1264                                                 qualifier. Only present if
1265                                                 "ValueKind" is "GlobalBuffer" or
1266                                                 "DynamicSharedPointer". Values
1267                                                 are:
1269                                                 - "Private"
1270                                                 - "Global"
1271                                                 - "Constant"
1272                                                 - "Local"
1273                                                 - "Generic"
1274                                                 - "Region"
1276                                                 .. TODO
1277                                                    Is GlobalBuffer only Global
1278                                                    or Constant? Is
1279                                                    DynamicSharedPointer always
1280                                                    Local? Can HCC allow Generic?
1281                                                    How can Private or Region
1282                                                    ever happen?
1283      "AccQual"         string                   Kernel argument access
1284                                                 qualifier. Only present if
1285                                                 "ValueKind" is "Image" or
1286                                                 "Pipe". Values
1287                                                 are:
1289                                                 - "ReadOnly"
1290                                                 - "WriteOnly"
1291                                                 - "ReadWrite"
1293                                                 .. TODO
1294                                                    Does this apply to
1295                                                    GlobalBuffer?
1296      "ActualAccQual"   string                   The actual memory accesses
1297                                                 performed by the kernel on the
1298                                                 kernel argument. Only present if
1299                                                 "ValueKind" is "GlobalBuffer",
1300                                                 "Image", or "Pipe". This may be
1301                                                 more restrictive than indicated
1302                                                 by "AccQual" to reflect what the
1303                                                 kernel actual does. If not
1304                                                 present then the runtime must
1305                                                 assume what is implied by
1306                                                 "AccQual" and "IsConst". Values
1307                                                 are:
1309                                                 - "ReadOnly"
1310                                                 - "WriteOnly"
1311                                                 - "ReadWrite"
1313      "IsConst"         boolean                  Indicates if the kernel argument
1314                                                 is const qualified. Only present
1315                                                 if "ValueKind" is
1316                                                 "GlobalBuffer".
1318      "IsRestrict"      boolean                  Indicates if the kernel argument
1319                                                 is restrict qualified. Only
1320                                                 present if "ValueKind" is
1321                                                 "GlobalBuffer".
1323      "IsVolatile"      boolean                  Indicates if the kernel argument
1324                                                 is volatile qualified. Only
1325                                                 present if "ValueKind" is
1326                                                 "GlobalBuffer".
1328      "IsPipe"          boolean                  Indicates if the kernel argument
1329                                                 is pipe qualified. Only present
1330                                                 if "ValueKind" is "Pipe".
1332                                                 .. TODO
1333                                                    Can GlobalBuffer be pipe
1334                                                    qualified?
1335      ================= ============== ========= ================================
1339   .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1340      :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1342      ============================ ============== ========= =====================
1343      String Key                   Value Type     Required? Description
1344      ============================ ============== ========= =====================
1345      "KernargSegmentSize"         integer        Required  The size in bytes of
1346                                                            the kernarg segment
1347                                                            that holds the values
1348                                                            of the arguments to
1349                                                            the kernel.
1350      "GroupSegmentFixedSize"      integer        Required  The amount of group
1351                                                            segment memory
1352                                                            required by a
1353                                                            work-group in
1354                                                            bytes. This does not
1355                                                            include any
1356                                                            dynamically allocated
1357                                                            group segment memory
1358                                                            that may be added
1359                                                            when the kernel is
1360                                                            dispatched.
1361      "PrivateSegmentFixedSize"    integer        Required  The amount of fixed
1362                                                            private address space
1363                                                            memory required for a
1364                                                            work-item in
1365                                                            bytes. If the kernel
1366                                                            uses a dynamic call
1367                                                            stack then additional
1368                                                            space must be added
1369                                                            to this value for the
1370                                                            call stack.
1371      "KernargSegmentAlign"        integer        Required  The maximum byte
1372                                                            alignment of
1373                                                            arguments in the
1374                                                            kernarg segment. Must
1375                                                            be a power of 2.
1376      "WavefrontSize"              integer        Required  Wavefront size. Must
1377                                                            be a power of 2.
1378      "NumSGPRs"                   integer        Required  Number of scalar
1379                                                            registers used by a
1380                                                            wavefront for
1381                                                            GFX6-GFX9. This
1382                                                            includes the special
1383                                                            SGPRs for VCC, Flat
1384                                                            Scratch (GFX7-GFX9)
1385                                                            and XNACK (for
1386                                                            GFX8-GFX9). It does
1387                                                            not include the 16
1388                                                            SGPR added if a trap
1389                                                            handler is
1390                                                            enabled. It is not
1391                                                            rounded up to the
1392                                                            allocation
1393                                                            granularity.
1394      "NumVGPRs"                   integer        Required  Number of vector
1395                                                            registers used by
1396                                                            each work-item for
1397                                                            GFX6-GFX9
1398      "MaxFlatWorkGroupSize"       integer        Required  Maximum flat
1399                                                            work-group size
1400                                                            supported by the
1401                                                            kernel in work-items.
1402                                                            Must be >=1 and
1403                                                            consistent with
1404                                                            ReqdWorkGroupSize if
1405                                                            not 0, 0, 0.
1406      "NumSpilledSGPRs"            integer                  Number of stores from
1407                                                            a scalar register to
1408                                                            a register allocator
1409                                                            created spill
1410                                                            location.
1411      "NumSpilledVGPRs"            integer                  Number of stores from
1412                                                            a vector register to
1413                                                            a register allocator
1414                                                            created spill
1415                                                            location.
1416      ============================ ============== ========= =====================
1420 Kernel Dispatch
1421 ~~~~~~~~~~~~~~~
1423 The HSA architected queuing language (AQL) defines a user space memory interface
1424 that can be used to control the dispatch of kernels, in an agent independent
1425 way. An agent can have zero or more AQL queues created for it using the ROCm
1426 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1427 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1428 mechanics and packet layouts.
1430 The packet processor of a kernel agent is responsible for detecting and
1431 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1432 packet processor is implemented by the hardware command processor (CP),
1433 asynchronous dispatch controller (ADC) and shader processor input controller
1434 (SPI).
1436 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1437 mode driver to initialize and register the AQL queue with CP.
1439 To dispatch a kernel the following actions are performed. This can occur in the
1440 CPU host program, or from an HSA kernel executing on a GPU.
1442 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1443    executed is obtained.
1444 2. A pointer to the kernel descriptor (see
1445    :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1446    obtained. It must be for a kernel that is contained in a code object that that
1447    was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1448    associated.
1449 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1450    for a memory region with the kernarg property for the kernel agent that will
1451    execute the kernel. It must be at least 16 byte aligned.
1452 4. Kernel argument values are assigned to the kernel argument memory
1453    allocation. The layout is defined in the *HSA Programmer's Language Reference*
1454    [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1455    memory in the same way constant memory is accessed. (Note that the HSA
1456    specification allows an implementation to copy the kernel argument contents to
1457    another location that is accessed by the kernel.)
1458 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1459    api uses 64 bit atomic operations to reserve space in the AQL queue for the
1460    packet. The packet must be set up, and the final write must use an atomic
1461    store release to set the packet kind to ensure the packet contents are
1462    visible to the kernel agent. AQL defines a doorbell signal mechanism to
1463    notify the kernel agent that the AQL queue has been updated. These rules, and
1464    the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1465    System Architecture Specification* [HSA]_.
1466 6. A kernel dispatch packet includes information about the actual dispatch,
1467    such as grid and work-group size, together with information from the code
1468    object about the kernel, such as segment sizes. The ROCm runtime queries on
1469    the kernel symbol can be used to obtain the code object values which are
1470    recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
1471 7. CP executes micro-code and is responsible for detecting and setting up the
1472    GPU to execute the wavefronts of a kernel dispatch.
1473 8. CP ensures that when the a wavefront starts executing the kernel machine
1474    code, the scalar general purpose registers (SGPR) and vector general purpose
1475    registers (VGPR) are set up as required by the machine code. The required
1476    setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1477    register state is defined in
1478    :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1479 9. The prolog of the kernel machine code (see
1480    :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1481    before continuing executing the machine code that corresponds to the kernel.
1482 10. When the kernel dispatch has completed execution, CP signals the completion
1483     signal specified in the kernel dispatch packet if not 0.
1485 .. _amdgpu-amdhsa-memory-spaces:
1487 Memory Spaces
1488 ~~~~~~~~~~~~~
1490 The memory space properties are:
1492   .. table:: AMDHSA Memory Spaces
1493      :name: amdgpu-amdhsa-memory-spaces-table
1495      ================= =========== ======== ======= ==================
1496      Memory Space Name HSA Segment Hardware Address NULL Value
1497                        Name        Name     Size
1498      ================= =========== ======== ======= ==================
1499      Private           private     scratch  32      0x00000000
1500      Local             group       LDS      32      0xFFFFFFFF
1501      Global            global      global   64      0x0000000000000000
1502      Constant          constant    *same as 64      0x0000000000000000
1503                                    global*
1504      Generic           flat        flat     64      0x0000000000000000
1505      Region            N/A         GDS      32      *not implemented
1506                                                     for AMDHSA*
1507      ================= =========== ======== ======= ==================
1509 The global and constant memory spaces both use global virtual addresses, which
1510 are the same virtual address space used by the CPU. However, some virtual
1511 addresses may only be accessible to the CPU, some only accessible by the GPU,
1512 and some by both.
1514 Using the constant memory space indicates that the data will not change during
1515 the execution of the kernel. This allows scalar read instructions to be
1516 used. The vector and scalar L1 caches are invalidated of volatile data before
1517 each kernel dispatch execution to allow constant memory to change values between
1518 kernel dispatches.
1520 The local memory space uses the hardware Local Data Store (LDS) which is
1521 automatically allocated when the hardware creates work-groups of wavefronts, and
1522 freed when all the wavefronts of a work-group have terminated. The data store
1523 (DS) instructions can be used to access it.
1525 The private memory space uses the hardware scratch memory support. If the kernel
1526 uses scratch, then the hardware allocates memory that is accessed using
1527 wavefront lane dword (4 byte) interleaving. The mapping used from private
1528 address to physical address is:
1530   ``wavefront-scratch-base +
1531   (private-address * wavefront-size * 4) +
1532   (wavefront-lane-id * 4)``
1534 There are different ways that the wavefront scratch base address is determined
1535 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1536 memory can be accessed in an interleaved manner using buffer instruction with
1537 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
1538 instructions, or by flat instructions. If each lane of a wavefront accesses the
1539 same private address, the interleaving results in adjacent dwords being accessed
1540 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1541 supported except by flat and scratch instructions in GFX9.
1543 The generic address space uses the hardware flat address support available in
1544 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1545 local appertures), that are outside the range of addressible global memory, to
1546 map from a flat address to a private or local address.
1548 FLAT instructions can take a flat address and access global, private (scratch)
1549 and group (LDS) memory depending in if the address is within one of the
1550 apperture ranges. Flat access to scratch requires hardware aperture setup and
1551 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1552 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1553 (see :ref:`amdgpu-amdhsa-m0`).
1555 To convert between a segment address and a flat address the base address of the
1556 appertures address can be used. For GFX7-GFX8 these are available in the
1557 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1558 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1559 GFX9 the appature base addresses are directly available as inline constant
1560 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1561 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1562 which makes it easier to convert from flat to segment or segment to flat.
1564 Image and Samplers
1565 ~~~~~~~~~~~~~~~~~~
1567 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1568 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1569 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1570 enumeration values for the queries that are not trivially deducible from the S#
1571 representation.
1573 HSA Signals
1574 ~~~~~~~~~~~
1576 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1577 structure allocated in memory accessible from both the CPU and GPU. The
1578 structure is defined by the ROCm runtime and subject to change between releases
1579 (see [AMD-ROCm-github]_).
1581 .. _amdgpu-amdhsa-hsa-aql-queue:
1583 HSA AQL Queue
1584 ~~~~~~~~~~~~~
1586 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1587 between releases (see [AMD-ROCm-github]_). For some processors it contains
1588 fields needed to implement certain language features such as the flat address
1589 aperture bases. It also contains fields used by CP such as managing the
1590 allocation of scratch memory.
1592 .. _amdgpu-amdhsa-kernel-descriptor:
1594 Kernel Descriptor
1595 ~~~~~~~~~~~~~~~~~
1597 A kernel descriptor consists of the information needed by CP to initiate the
1598 execution of a kernel, including the entry point address of the machine code
1599 that implements the kernel.
1601 Kernel Descriptor for GFX6-GFX9
1602 +++++++++++++++++++++++++++++++
1604 CP microcode requires the Kernel descriptor to be allocated on 64 byte
1605 alignment.
1607   .. table:: Kernel Descriptor for GFX6-GFX9
1608      :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1610      ======= ======= =============================== ============================
1611      Bits    Size    Field Name                      Description
1612      ======= ======= =============================== ============================
1613      31:0    4 bytes GROUP_SEGMENT_FIXED_SIZE        The amount of fixed local
1614                                                      address space memory
1615                                                      required for a work-group
1616                                                      in bytes. This does not
1617                                                      include any dynamically
1618                                                      allocated local address
1619                                                      space memory that may be
1620                                                      added when the kernel is
1621                                                      dispatched.
1622      63:32   4 bytes PRIVATE_SEGMENT_FIXED_SIZE      The amount of fixed
1623                                                      private address space
1624                                                      memory required for a
1625                                                      work-item in bytes. If
1626                                                      is_dynamic_callstack is 1
1627                                                      then additional space must
1628                                                      be added to this value for
1629                                                      the call stack.
1630      127:64  8 bytes                                 Reserved, must be 0.
1631      191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET   Byte offset (possibly
1632                                                      negative) from base
1633                                                      address of kernel
1634                                                      descriptor to kernel's
1635                                                      entry point instruction
1636                                                      which must be 256 byte
1637                                                      aligned.
1638      383:192 24                                      Reserved, must be 0.
1639              bytes
1640      415:384 4 bytes COMPUTE_PGM_RSRC1               Compute Shader (CS)
1641                                                      program settings used by
1642                                                      CP to set up
1643                                                      ``COMPUTE_PGM_RSRC1``
1644                                                      configuration
1645                                                      register. See
1646                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1647      447:416 4 bytes COMPUTE_PGM_RSRC2               Compute Shader (CS)
1648                                                      program settings used by
1649                                                      CP to set up
1650                                                      ``COMPUTE_PGM_RSRC2``
1651                                                      configuration
1652                                                      register. See
1653                                                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1654      448     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
1655                      _BUFFER                         SGPR user data registers
1656                                                      (see
1657                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1659                                                      The total number of SGPR
1660                                                      user data registers
1661                                                      requested must not exceed
1662                                                      16 and match value in
1663                                                      ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1664                                                      Any requests beyond 16
1665                                                      will be ignored.
1666      449     1 bit   ENABLE_SGPR_DISPATCH_PTR        *see above*
1667      450     1 bit   ENABLE_SGPR_QUEUE_PTR           *see above*
1668      451     1 bit   ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above*
1669      452     1 bit   ENABLE_SGPR_DISPATCH_ID         *see above*
1670      453     1 bit   ENABLE_SGPR_FLAT_SCRATCH_INIT   *see above*
1671      454     1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     *see above*
1672                      _SIZE
1673      455     1 bit                                   Reserved, must be 0.
1674      511:456 8 bytes                                 Reserved, must be 0.
1675      512     **Total size 64 bytes.**
1676      ======= ====================================================================
1680   .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1681      :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1683      ======= ======= =============================== ===========================================================================
1684      Bits    Size    Field Name                      Description
1685      ======= ======= =============================== ===========================================================================
1686      5:0     6 bits  GRANULATED_WORKITEM_VGPR_COUNT  Number of vector register
1687                                                      blocks used by each work-item;
1688                                                      granularity is device
1689                                                      specific:
1691                                                      GFX6-GFX9
1692                                                        - vgprs_used 0..256
1693                                                        - max(0, ceil(vgprs_used / 4) - 1)
1695                                                      Where vgprs_used is defined
1696                                                      as the highest VGPR number
1697                                                      explicitly referenced plus
1698                                                      one.
1700                                                      Used by CP to set up
1701                                                      ``COMPUTE_PGM_RSRC1.VGPRS``.
1703                                                      The
1704                                                      :ref:`amdgpu-assembler`
1705                                                      calculates this
1706                                                      automatically for the
1707                                                      selected processor from
1708                                                      values provided to the
1709                                                      `.amdhsa_kernel` directive
1710                                                      by the
1711                                                      `.amdhsa_next_free_vgpr`
1712                                                      nested directive (see
1713                                                      :ref:`amdhsa-kernel-directives-table`).
1714      9:6     4 bits  GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
1715                                                      blocks used by a wavefront;
1716                                                      granularity is device
1717                                                      specific:
1719                                                      GFX6-GFX8
1720                                                        - sgprs_used 0..112
1721                                                        - max(0, ceil(sgprs_used / 8) - 1)
1722                                                      GFX9
1723                                                        - sgprs_used 0..112
1724                                                        - 2 * max(0, ceil(sgprs_used / 16) - 1)
1726                                                      Where sgprs_used is
1727                                                      defined as the highest
1728                                                      SGPR number explicitly
1729                                                      referenced plus one, plus
1730                                                      a target-specific number
1731                                                      of additional special
1732                                                      SGPRs for VCC,
1733                                                      FLAT_SCRATCH (GFX7+) and
1734                                                      XNACK_MASK (GFX8+), and
1735                                                      any additional
1736                                                      target-specific
1737                                                      limitations. It does not
1738                                                      include the 16 SGPRs added
1739                                                      if a trap handler is
1740                                                      enabled.
1742                                                      The target-specific
1743                                                      limitations and special
1744                                                      SGPR layout are defined in
1745                                                      the hardware
1746                                                      documentation, which can
1747                                                      be found in the
1748                                                      :ref:`amdgpu-processors`
1749                                                      table.
1751                                                      Used by CP to set up
1752                                                      ``COMPUTE_PGM_RSRC1.SGPRS``.
1754                                                      The
1755                                                      :ref:`amdgpu-assembler`
1756                                                      calculates this
1757                                                      automatically for the
1758                                                      selected processor from
1759                                                      values provided to the
1760                                                      `.amdhsa_kernel` directive
1761                                                      by the
1762                                                      `.amdhsa_next_free_sgpr`
1763                                                      and `.amdhsa_reserve_*`
1764                                                      nested directives (see
1765                                                      :ref:`amdhsa-kernel-directives-table`).
1766      11:10   2 bits  PRIORITY                        Must be 0.
1768                                                      Start executing wavefront
1769                                                      at the specified priority.
1771                                                      CP is responsible for
1772                                                      filling in
1773                                                      ``COMPUTE_PGM_RSRC1.PRIORITY``.
1774      13:12   2 bits  FLOAT_ROUND_MODE_32             Wavefront starts execution
1775                                                      with specified rounding
1776                                                      mode for single (32
1777                                                      bit) floating point
1778                                                      precision floating point
1779                                                      operations.
1781                                                      Floating point rounding
1782                                                      mode values are defined in
1783                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1785                                                      Used by CP to set up
1786                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1787      15:14   2 bits  FLOAT_ROUND_MODE_16_64          Wavefront starts execution
1788                                                      with specified rounding
1789                                                      denorm mode for half/double (16
1790                                                      and 64 bit) floating point
1791                                                      precision floating point
1792                                                      operations.
1794                                                      Floating point rounding
1795                                                      mode values are defined in
1796                                                      :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1798                                                      Used by CP to set up
1799                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1800      17:16   2 bits  FLOAT_DENORM_MODE_32            Wavefront starts execution
1801                                                      with specified denorm mode
1802                                                      for single (32
1803                                                      bit)  floating point
1804                                                      precision floating point
1805                                                      operations.
1807                                                      Floating point denorm mode
1808                                                      values are defined in
1809                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1811                                                      Used by CP to set up
1812                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1813      19:18   2 bits  FLOAT_DENORM_MODE_16_64         Wavefront starts execution
1814                                                      with specified denorm mode
1815                                                      for half/double (16
1816                                                      and 64 bit) floating point
1817                                                      precision floating point
1818                                                      operations.
1820                                                      Floating point denorm mode
1821                                                      values are defined in
1822                                                      :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1824                                                      Used by CP to set up
1825                                                      ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1826      20      1 bit   PRIV                            Must be 0.
1828                                                      Start executing wavefront
1829                                                      in privilege trap handler
1830                                                      mode.
1832                                                      CP is responsible for
1833                                                      filling in
1834                                                      ``COMPUTE_PGM_RSRC1.PRIV``.
1835      21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
1836                                                      with DX10 clamp mode
1837                                                      enabled. Used by the vector
1838                                                      ALU to force DX10 style
1839                                                      treatment of NaN's (when
1840                                                      set, clamp NaN to zero,
1841                                                      otherwise pass NaN
1842                                                      through).
1844                                                      Used by CP to set up
1845                                                      ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1846      22      1 bit   DEBUG_MODE                      Must be 0.
1848                                                      Start executing wavefront
1849                                                      in single step mode.
1851                                                      CP is responsible for
1852                                                      filling in
1853                                                      ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1854      23      1 bit   ENABLE_IEEE_MODE                Wavefront starts execution
1855                                                      with IEEE mode
1856                                                      enabled. Floating point
1857                                                      opcodes that support
1858                                                      exception flag gathering
1859                                                      will quiet and propagate
1860                                                      signaling-NaN inputs per
1861                                                      IEEE 754-2008. Min_dx10 and
1862                                                      max_dx10 become IEEE
1863                                                      754-2008 compliant due to
1864                                                      signaling-NaN propagation
1865                                                      and quieting.
1867                                                      Used by CP to set up
1868                                                      ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1869      24      1 bit   BULKY                           Must be 0.
1871                                                      Only one work-group allowed
1872                                                      to execute on a compute
1873                                                      unit.
1875                                                      CP is responsible for
1876                                                      filling in
1877                                                      ``COMPUTE_PGM_RSRC1.BULKY``.
1878      25      1 bit   CDBG_USER                       Must be 0.
1880                                                      Flag that can be used to
1881                                                      control debugging code.
1883                                                      CP is responsible for
1884                                                      filling in
1885                                                      ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1886      26      1 bit   FP16_OVFL                       GFX6-GFX8
1887                                                        Reserved, must be 0.
1888                                                      GFX9
1889                                                        Wavefront starts execution
1890                                                        with specified fp16 overflow
1891                                                        mode.
1893                                                        - If 0, fp16 overflow generates
1894                                                          +/-INF values.
1895                                                        - If 1, fp16 overflow that is the
1896                                                          result of an +/-INF input value
1897                                                          or divide by 0 produces a +/-INF,
1898                                                          otherwise clamps computed
1899                                                          overflow to +/-MAX_FP16 as
1900                                                          appropriate.
1902                                                        Used by CP to set up
1903                                                        ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1904      31:27   5 bits                                  Reserved, must be 0.
1905      32      **Total size 4 bytes**
1906      ======= ===================================================================================================================
1910   .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1911      :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1913      ======= ======= =============================== ===========================================================================
1914      Bits    Size    Field Name                      Description
1915      ======= ======= =============================== ===========================================================================
1916      0       1 bit   ENABLE_SGPR_PRIVATE_SEGMENT     Enable the setup of the
1917                      _WAVEFRONT_OFFSET               SGPR wavefront scratch offset
1918                                                      system register (see
1919                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1921                                                      Used by CP to set up
1922                                                      ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1923      5:1     5 bits  USER_SGPR_COUNT                 The total number of SGPR
1924                                                      user data registers
1925                                                      requested. This number must
1926                                                      match the number of user
1927                                                      data registers enabled.
1929                                                      Used by CP to set up
1930                                                      ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1931      6       1 bit   ENABLE_TRAP_HANDLER             Must be 0.
1933                                                      This bit represents
1934                                                      ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
1935                                                      which is set by the CP if
1936                                                      the runtime has installed a
1937                                                      trap handler.
1938      7       1 bit   ENABLE_SGPR_WORKGROUP_ID_X      Enable the setup of the
1939                                                      system SGPR register for
1940                                                      the work-group id in the X
1941                                                      dimension (see
1942                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1944                                                      Used by CP to set up
1945                                                      ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1946      8       1 bit   ENABLE_SGPR_WORKGROUP_ID_Y      Enable the setup of the
1947                                                      system SGPR register for
1948                                                      the work-group id in the Y
1949                                                      dimension (see
1950                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1952                                                      Used by CP to set up
1953                                                      ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1954      9       1 bit   ENABLE_SGPR_WORKGROUP_ID_Z      Enable the setup of the
1955                                                      system SGPR register for
1956                                                      the work-group id in the Z
1957                                                      dimension (see
1958                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1960                                                      Used by CP to set up
1961                                                      ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1962      10      1 bit   ENABLE_SGPR_WORKGROUP_INFO      Enable the setup of the
1963                                                      system SGPR register for
1964                                                      work-group information (see
1965                                                      :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1967                                                      Used by CP to set up
1968                                                      ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1969      12:11   2 bits  ENABLE_VGPR_WORKITEM_ID         Enable the setup of the
1970                                                      VGPR system registers used
1971                                                      for the work-item ID.
1972                                                      :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1973                                                      defines the values.
1975                                                      Used by CP to set up
1976                                                      ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1977      13      1 bit   ENABLE_EXCEPTION_ADDRESS_WATCH  Must be 0.
1979                                                      Wavefront starts execution
1980                                                      with address watch
1981                                                      exceptions enabled which
1982                                                      are generated when L1 has
1983                                                      witnessed a thread access
1984                                                      an *address of
1985                                                      interest*.
1987                                                      CP is responsible for
1988                                                      filling in the address
1989                                                      watch bit in
1990                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1991                                                      according to what the
1992                                                      runtime requests.
1993      14      1 bit   ENABLE_EXCEPTION_MEMORY         Must be 0.
1995                                                      Wavefront starts execution
1996                                                      with memory violation
1997                                                      exceptions exceptions
1998                                                      enabled which are generated
1999                                                      when a memory violation has
2000                                                      occurred for this wavefront from
2001                                                      L1 or LDS
2002                                                      (write-to-read-only-memory,
2003                                                      mis-aligned atomic, LDS
2004                                                      address out of range,
2005                                                      illegal address, etc.).
2007                                                      CP sets the memory
2008                                                      violation bit in
2009                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2010                                                      according to what the
2011                                                      runtime requests.
2012      23:15   9 bits  GRANULATED_LDS_SIZE             Must be 0.
2014                                                      CP uses the rounded value
2015                                                      from the dispatch packet,
2016                                                      not this value, as the
2017                                                      dispatch may contain
2018                                                      dynamically allocated group
2019                                                      segment memory. CP writes
2020                                                      directly to
2021                                                      ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
2023                                                      Amount of group segment
2024                                                      (LDS) to allocate for each
2025                                                      work-group. Granularity is
2026                                                      device specific:
2028                                                      GFX6:
2029                                                        roundup(lds-size / (64 * 4))
2030                                                      GFX7-GFX9:
2031                                                        roundup(lds-size / (128 * 4))
2033      24      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    Wavefront starts execution
2034                      _INVALID_OPERATION              with specified exceptions
2035                                                      enabled.
2037                                                      Used by CP to set up
2038                                                      ``COMPUTE_PGM_RSRC2.EXCP_EN``
2039                                                      (set from bits 0..6).
2041                                                      IEEE 754 FP Invalid
2042                                                      Operation
2043      25      1 bit   ENABLE_EXCEPTION_FP_DENORMAL    FP Denormal one or more
2044                      _SOURCE                         input operands is a
2045                                                      denormal number
2046      26      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Division by
2047                      _DIVISION_BY_ZERO               Zero
2048      27      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP FP Overflow
2049                      _OVERFLOW
2050      28      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Underflow
2051                      _UNDERFLOW
2052      29      1 bit   ENABLE_EXCEPTION_IEEE_754_FP    IEEE 754 FP Inexact
2053                      _INEXACT
2054      30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
2055                      _ZERO                           (rcp_iflag_f32 instruction
2056                                                      only)
2057      31      1 bit                                   Reserved, must be 0.
2058      32      **Total size 4 bytes.**
2059      ======= ===================================================================================================================
2063   .. table:: Floating Point Rounding Mode Enumeration Values
2064      :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
2066      ====================================== ===== ==============================
2067      Enumeration Name                       Value Description
2068      ====================================== ===== ==============================
2069      FLOAT_ROUND_MODE_NEAR_EVEN             0     Round Ties To Even
2070      FLOAT_ROUND_MODE_PLUS_INFINITY         1     Round Toward +infinity
2071      FLOAT_ROUND_MODE_MINUS_INFINITY        2     Round Toward -infinity
2072      FLOAT_ROUND_MODE_ZERO                  3     Round Toward 0
2073      ====================================== ===== ==============================
2077   .. table:: Floating Point Denorm Mode Enumeration Values
2078      :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2080      ====================================== ===== ==============================
2081      Enumeration Name                       Value Description
2082      ====================================== ===== ==============================
2083      FLOAT_DENORM_MODE_FLUSH_SRC_DST        0     Flush Source and Destination
2084                                                   Denorms
2085      FLOAT_DENORM_MODE_FLUSH_DST            1     Flush Output Denorms
2086      FLOAT_DENORM_MODE_FLUSH_SRC            2     Flush Source Denorms
2087      FLOAT_DENORM_MODE_FLUSH_NONE           3     No Flush
2088      ====================================== ===== ==============================
2092   .. table:: System VGPR Work-Item ID Enumeration Values
2093      :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2095      ======================================== ===== ============================
2096      Enumeration Name                         Value Description
2097      ======================================== ===== ============================
2098      SYSTEM_VGPR_WORKITEM_ID_X                0     Set work-item X dimension
2099                                                     ID.
2100      SYSTEM_VGPR_WORKITEM_ID_X_Y              1     Set work-item X and Y
2101                                                     dimensions ID.
2102      SYSTEM_VGPR_WORKITEM_ID_X_Y_Z            2     Set work-item X, Y and Z
2103                                                     dimensions ID.
2104      SYSTEM_VGPR_WORKITEM_ID_UNDEFINED        3     Undefined.
2105      ======================================== ===== ============================
2107 .. _amdgpu-amdhsa-initial-kernel-execution-state:
2109 Initial Kernel Execution State
2110 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2112 This section defines the register state that will be set up by the packet
2113 processor prior to the start of execution of every wavefront. This is limited by
2114 the constraints of the hardware controllers of CP/ADC/SPI.
2116 The order of the SGPR registers is defined, but the compiler can specify which
2117 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2118 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2119 for enabled registers are dense starting at SGPR0: the first enabled register is
2120 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2121 an SGPR number.
2123 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2124 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
2125 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2126 initialized. These are then immediately followed by the System SGPRs that are
2127 set up by ADC/SPI and can have different values for each wavefront of the grid
2128 dispatch.
2130 SGPR register initial state is defined in
2131 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2133   .. table:: SGPR Register Set Up Order
2134      :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2136      ========== ========================== ====== ==============================
2137      SGPR Order Name                       Number Description
2138                 (kernel descriptor enable  of
2139                 field)                     SGPRs
2140      ========== ========================== ====== ==============================
2141      First      Private Segment Buffer     4      V# that can be used, together
2142                 (enable_sgpr_private              with Scratch Wavefront Offset
2143                 _segment_buffer)                  as an offset, to access the
2144                                                   private memory space using a
2145                                                   segment address.
2147                                                   CP uses the value provided by
2148                                                   the runtime.
2149      then       Dispatch Ptr               2      64 bit address of AQL dispatch
2150                 (enable_sgpr_dispatch_ptr)        packet for kernel dispatch
2151                                                   actually executing.
2152      then       Queue Ptr                  2      64 bit address of amd_queue_t
2153                 (enable_sgpr_queue_ptr)           object for AQL queue on which
2154                                                   the dispatch packet was
2155                                                   queued.
2156      then       Kernarg Segment Ptr        2      64 bit address of Kernarg
2157                 (enable_sgpr_kernarg              segment. This is directly
2158                 _segment_ptr)                     copied from the
2159                                                   kernarg_address in the kernel
2160                                                   dispatch packet.
2162                                                   Having CP load it once avoids
2163                                                   loading it at the beginning of
2164                                                   every wavefront.
2165      then       Dispatch Id                2      64 bit Dispatch ID of the
2166                 (enable_sgpr_dispatch_id)         dispatch packet being
2167                                                   executed.
2168      then       Flat Scratch Init          2      This is 2 SGPRs:
2169                 (enable_sgpr_flat_scratch
2170                 _init)                            GFX6
2171                                                     Not supported.
2172                                                   GFX7-GFX8
2173                                                     The first SGPR is a 32 bit
2174                                                     byte offset from
2175                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2176                                                     to per SPI base of memory
2177                                                     for scratch for the queue
2178                                                     executing the kernel
2179                                                     dispatch. CP obtains this
2180                                                     from the runtime. (The
2181                                                     Scratch Segment Buffer base
2182                                                     address is
2183                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2184                                                     plus this offset.) The value
2185                                                     of Scratch Wavefront Offset must
2186                                                     be added to this offset by
2187                                                     the kernel machine code,
2188                                                     right shifted by 8, and
2189                                                     moved to the FLAT_SCRATCH_HI
2190                                                     SGPR register.
2191                                                     FLAT_SCRATCH_HI corresponds
2192                                                     to SGPRn-4 on GFX7, and
2193                                                     SGPRn-6 on GFX8 (where SGPRn
2194                                                     is the highest numbered SGPR
2195                                                     allocated to the wavefront).
2196                                                     FLAT_SCRATCH_HI is
2197                                                     multiplied by 256 (as it is
2198                                                     in units of 256 bytes) and
2199                                                     added to
2200                                                     ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2201                                                     to calculate the per wavefront
2202                                                     FLAT SCRATCH BASE in flat
2203                                                     memory instructions that
2204                                                     access the scratch
2205                                                     apperture.
2207                                                     The second SGPR is 32 bit
2208                                                     byte size of a single
2209                                                     work-item's scratch memory
2210                                                     usage. CP obtains this from
2211                                                     the runtime, and it is
2212                                                     always a multiple of DWORD.
2213                                                     CP checks that the value in
2214                                                     the kernel dispatch packet
2215                                                     Private Segment Byte Size is
2216                                                     not larger, and requests the
2217                                                     runtime to increase the
2218                                                     queue's scratch size if
2219                                                     necessary. The kernel code
2220                                                     must move it to
2221                                                     FLAT_SCRATCH_LO which is
2222                                                     SGPRn-3 on GFX7 and SGPRn-5
2223                                                     on GFX8. FLAT_SCRATCH_LO is
2224                                                     used as the FLAT SCRATCH
2225                                                     SIZE in flat memory
2226                                                     instructions. Having CP load
2227                                                     it once avoids loading it at
2228                                                     the beginning of every
2229                                                     wavefront.
2230                                                   GFX9
2231                                                     This is the
2232                                                     64 bit base address of the
2233                                                     per SPI scratch backing
2234                                                     memory managed by SPI for
2235                                                     the queue executing the
2236                                                     kernel dispatch. CP obtains
2237                                                     this from the runtime (and
2238                                                     divides it if there are
2239                                                     multiple Shader Arrays each
2240                                                     with its own SPI). The value
2241                                                     of Scratch Wavefront Offset must
2242                                                     be added by the kernel
2243                                                     machine code and the result
2244                                                     moved to the FLAT_SCRATCH
2245                                                     SGPR which is SGPRn-6 and
2246                                                     SGPRn-5. It is used as the
2247                                                     FLAT SCRATCH BASE in flat
2248                                                     memory instructions.
2249      then       Private Segment Size       1      The 32 bit byte size of a
2250                                                   (enable_sgpr_private single
2251                                                   work-item's
2252                                                   scratch_segment_size) memory
2253                                                   allocation. This is the
2254                                                   value from the kernel
2255                                                   dispatch packet Private
2256                                                   Segment Byte Size rounded up
2257                                                   by CP to a multiple of
2258                                                   DWORD.
2260                                                   Having CP load it once avoids
2261                                                   loading it at the beginning of
2262                                                   every wavefront.
2264                                                   This is not used for
2265                                                   GFX7-GFX8 since it is the same
2266                                                   value as the second SGPR of
2267                                                   Flat Scratch Init. However, it
2268                                                   may be needed for GFX9 which
2269                                                   changes the meaning of the
2270                                                   Flat Scratch Init value.
2271      then       Grid Work-Group Count X    1      32 bit count of the number of
2272                 (enable_sgpr_grid                 work-groups in the X dimension
2273                 _workgroup_count_X)               for the grid being
2274                                                   executed. Computed from the
2275                                                   fields in the kernel dispatch
2276                                                   packet as ((grid_size.x +
2277                                                   workgroup_size.x - 1) /
2278                                                   workgroup_size.x).
2279      then       Grid Work-Group Count Y    1      32 bit count of the number of
2280                 (enable_sgpr_grid                 work-groups in the Y dimension
2281                 _workgroup_count_Y &&             for the grid being
2282                 less than 16 previous             executed. Computed from the
2283                 SGPRs)                            fields in the kernel dispatch
2284                                                   packet as ((grid_size.y +
2285                                                   workgroup_size.y - 1) /
2286                                                   workgroupSize.y).
2288                                                   Only initialized if <16
2289                                                   previous SGPRs initialized.
2290      then       Grid Work-Group Count Z    1      32 bit count of the number of
2291                 (enable_sgpr_grid                 work-groups in the Z dimension
2292                 _workgroup_count_Z &&             for the grid being
2293                 less than 16 previous             executed. Computed from the
2294                 SGPRs)                            fields in the kernel dispatch
2295                                                   packet as ((grid_size.z +
2296                                                   workgroup_size.z - 1) /
2297                                                   workgroupSize.z).
2299                                                   Only initialized if <16
2300                                                   previous SGPRs initialized.
2301      then       Work-Group Id X            1      32 bit work-group id in X
2302                 (enable_sgpr_workgroup_id         dimension of grid for
2303                 _X)                               wavefront.
2304      then       Work-Group Id Y            1      32 bit work-group id in Y
2305                 (enable_sgpr_workgroup_id         dimension of grid for
2306                 _Y)                               wavefront.
2307      then       Work-Group Id Z            1      32 bit work-group id in Z
2308                 (enable_sgpr_workgroup_id         dimension of grid for
2309                 _Z)                               wavefront.
2310      then       Work-Group Info            1      {first_wavefront, 14'b0000,
2311                 (enable_sgpr_workgroup            ordered_append_term[10:0],
2312                 _info)                            threadgroup_size_in_wavefronts[5:0]}
2313      then       Scratch Wavefront Offset   1      32 bit byte offset from base
2314                 (enable_sgpr_private              of scratch base of queue
2315                 _segment_wavefront_offset)        executing the kernel
2316                                                   dispatch. Must be used as an
2317                                                   offset with Private
2318                                                   segment address when using
2319                                                   Scratch Segment Buffer. It
2320                                                   must be used to set up FLAT
2321                                                   SCRATCH for flat addressing
2322                                                   (see
2323                                                   :ref:`amdgpu-amdhsa-flat-scratch`).
2324      ========== ========================== ====== ==============================
2326 The order of the VGPR registers is defined, but the compiler can specify which
2327 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2328 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2329 for enabled registers are dense starting at VGPR0: the first enabled register is
2330 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2331 VGPR number.
2333 VGPR register initial state is defined in
2334 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2336   .. table:: VGPR Register Set Up Order
2337      :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2339      ========== ========================== ====== ==============================
2340      VGPR Order Name                       Number Description
2341                 (kernel descriptor enable  of
2342                 field)                     VGPRs
2343      ========== ========================== ====== ==============================
2344      First      Work-Item Id X             1      32 bit work item id in X
2345                 (Always initialized)              dimension of work-group for
2346                                                   wavefront lane.
2347      then       Work-Item Id Y             1      32 bit work item id in Y
2348                 (enable_vgpr_workitem_id          dimension of work-group for
2349                 > 0)                              wavefront lane.
2350      then       Work-Item Id Z             1      32 bit work item id in Z
2351                 (enable_vgpr_workitem_id          dimension of work-group for
2352                 > 1)                              wavefront lane.
2353      ========== ========================== ====== ==============================
2355 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
2357 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2358    registers.
2359 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2360    combination including none.
2361 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2362    its value cannot included with the flat scratch init value which is per queue.
2363 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2364    or (X, Y, Z).
2366 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2367 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2369 The global segment can be accessed either using buffer instructions (GFX6 which
2370 has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2371 instructions (GFX9).
2373 If buffer operations are used then the compiler can generate a V# with the
2374 following properties:
2376 * base address of 0
2377 * no swizzle
2378 * ATC: 1 if IOMMU present (such as APU)
2379 * ptr64: 1
2380 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2381   APU and NC for dGPU).
2383 .. _amdgpu-amdhsa-kernel-prolog:
2385 Kernel Prolog
2386 ~~~~~~~~~~~~~
2388 .. _amdgpu-amdhsa-m0:
2393 GFX6-GFX8
2394   The M0 register must be initialized with a value at least the total LDS size
2395   if the kernel may access LDS via DS or flat operations. Total LDS size is
2396   available in dispatch packet. For M0, it is also possible to use maximum
2397   possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2398   GFX7-GFX8).
2399 GFX9
2400   The M0 register is not used for range checking LDS accesses and so does not
2401   need to be initialized in the prolog.
2403 .. _amdgpu-amdhsa-flat-scratch:
2405 Flat Scratch
2406 ++++++++++++
2408 If the kernel may use flat operations to access scratch memory, the prolog code
2409 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2410 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
2411 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2413 GFX6
2414   Flat scratch is not supported.
2416 GFX7-GFX8
2417   1. The low word of Flat Scratch Init is 32 bit byte offset from
2418      ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2419      being managed by SPI for the queue executing the kernel dispatch. This is
2420      the same value used in the Scratch Segment Buffer V# base address. The
2421      prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
2422      scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2423      FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2424      by 8 before moving into FLAT_SCRATCH_LO.
2425   2. The second word of Flat Scratch Init is 32 bit byte size of a single
2426      work-items scratch memory usage. This is directly loaded from the kernel
2427      dispatch packet Private Segment Byte Size and rounded up to a multiple of
2428      DWORD. Having CP load it once avoids loading it at the beginning of every
2429      wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2430      SIZE.
2432 GFX9
2433   The Flat Scratch Init is the 64 bit address of the base of scratch backing
2434   memory being managed by SPI for the queue executing the kernel dispatch. The
2435   prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
2436   pair for use as the flat scratch base in flat memory instructions.
2438 .. _amdgpu-amdhsa-memory-model:
2440 Memory Model
2441 ~~~~~~~~~~~~
2443 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2444 (see :ref:`memmodel`). *The implementation is WIP.*
2446 .. TODO
2447    Update when implementation complete.
2449 The AMDGPU backend supports the memory synchronization scopes specified in
2450 :ref:`amdgpu-memory-scopes`.
2452 The code sequences used to implement the memory model are defined in table
2453 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2455 The sequences specify the order of instructions that a single thread must
2456 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2457 to other memory instructions executed by the same thread. This allows them to be
2458 moved earlier or later which can allow them to be combined with other instances
2459 of the same instruction, or hoisted/sunk out of loops to improve
2460 performance. Only the instructions related to the memory model are given;
2461 additional ``s_waitcnt`` instructions are required to ensure registers are
2462 defined before being used. These may be able to be combined with the memory
2463 model ``s_waitcnt`` instructions as described above.
2465 The AMDGPU backend supports the following memory models:
2467   HSA Memory Model [HSA]_
2468     The HSA memory model uses a single happens-before relation for all address
2469     spaces (see :ref:`amdgpu-address-spaces`).
2470   OpenCL Memory Model [OpenCL]_
2471     The OpenCL memory model which has separate happens-before relations for the
2472     global and local address spaces. Only a fence specifying both global and
2473     local address space, and seq_cst instructions join the relationships. Since
2474     the LLVM ``memfence`` instruction does not allow an address space to be
2475     specified the OpenCL fence has to convervatively assume both local and
2476     global address space was specified. However, optimizations can often be
2477     done to eliminate the additional ``s_waitcnt`` instructions when there are
2478     no intervening memory instructions which access the corresponding address
2479     space. The code sequences in the table indicate what can be omitted for the
2480     OpenCL memory. The target triple environment is used to determine if the
2481     source language is OpenCL (see :ref:`amdgpu-opencl`).
2483 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2484 operations.
2486 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2487 termed vector memory operations.
2489 For GFX6-GFX9:
2491 * Each agent has multiple compute units (CU).
2492 * Each CU has multiple SIMDs that execute wavefronts.
2493 * The wavefronts for a single work-group are executed in the same CU but may be
2494   executed by different SIMDs.
2495 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2496   executing on it.
2497 * All LDS operations of a CU are performed as wavefront wide operations in a
2498   global order and involve no caching. Completion is reported to a wavefront in
2499   execution order.
2500 * The LDS memory has multiple request queues shared by the SIMDs of a
2501   CU. Therefore, the LDS operations performed by different wavefronts of a work-group
2502   can be reordered relative to each other, which can result in reordering the
2503   visibility of vector memory operations with respect to LDS operations of other
2504   wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2505   ensure synchronization between LDS operations and vector memory operations
2506   between wavefronts of a work-group, but not between operations performed by the
2507   same wavefront.
2508 * The vector memory operations are performed as wavefront wide operations and
2509   completion is reported to a wavefront in execution order. The exception is
2510   that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2511   vector memory order if they access LDS memory, and out of LDS operation order
2512   if they access global memory.
2513 * The vector memory operations access a single vector L1 cache shared by all
2514   SIMDs a CU. Therefore, no special action is required for coherence between the
2515   lanes of a single wavefront, or for coherence between wavefronts in the same
2516   work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
2517   executing in different work-groups as they may be executing on different CUs.
2518 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2519   on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2520   scalar operations are used in a restricted way so do not impact the memory
2521   model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2522 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2523   the same agent.
2524 * The L2 cache has independent channels to service disjoint ranges of virtual
2525   addresses.
2526 * Each CU has a separate request queue per channel. Therefore, the vector and
2527   scalar memory operations performed by wavefronts executing in different work-groups
2528   (which may be executing on different CUs) of an agent can be reordered
2529   relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2530   synchronization between vector memory operations of different CUs. It ensures a
2531   previous vector memory operation has completed before executing a subsequent
2532   vector memory or LDS operation and so can be used to meet the requirements of
2533   acquire and release.
2534 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2535   of virtual addresses can be set up to bypass it to ensure system coherence.
2537 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2538 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2539 memory, atomic memory orderings are not meaningful and all accesses are treated
2540 as non-atomic.
2542 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2543 scalar memory instructions). Since the constant address space contents do not
2544 change during the execution of a kernel dispatch it is not legal to perform
2545 stores, and atomic memory orderings are not meaningful and all access are
2546 treated as non-atomic.
2548 A memory synchronization scope wider than work-group is not meaningful for the
2549 group (LDS) address space and is treated as work-group.
2551 The memory model does not support the region address space which is treated as
2552 non-atomic.
2554 Acquire memory ordering is not meaningful on store atomic instructions and is
2555 treated as non-atomic.
2557 Release memory ordering is not meaningful on load atomic instructions and is
2558 treated a non-atomic.
2560 Acquire-release memory ordering is not meaningful on load or store atomic
2561 instructions and is treated as acquire and release respectively.
2563 AMDGPU backend only uses scalar memory operations to access memory that is
2564 proven to not change during the execution of the kernel dispatch. This includes
2565 constant address space and global address space for program scope const
2566 variables. Therefore the kernel machine code does not have to maintain the
2567 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2568 and vector L1 caches are invalidated between kernel dispatches by CP since
2569 constant address space data may change between kernel dispatch executions. See
2570 :ref:`amdgpu-amdhsa-memory-spaces`.
2572 The one execption is if scalar writes are used to spill SGPR registers. In this
2573 case the AMDGPU backend ensures the memory location used to spill is never
2574 accessed by vector memory operations at the same time. If scalar writes are used
2575 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2576 return since the locations may be used for vector memory instructions by a
2577 future wavefront that uses the same scratch area, or a function call that creates a
2578 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2579 as all scalar writes are write-before-read in the same thread.
2581 Scratch backing memory (which is used for the private address space)
2582 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2583 address space is only accessed by a single thread, and is always
2584 write-before-read, there is never a need to invalidate these entries from the L1
2585 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2586 volatile cache lines.
2588 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2589 to invalidate the L2 cache. This also causes it to be treated as
2590 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2591 (cache coherent) and so the L2 cache will coherent with the CPU and other
2592 agents.
2594   .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2595      :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2597      ============ ============ ============== ========== ===============================
2598      LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
2599                   Ordering     Sync Scope     Address
2600                                               Space
2601      ============ ============ ============== ========== ===============================
2602      **Non-Atomic**
2603      -----------------------------------------------------------------------------------
2604      load         *none*       *none*         - global   - !volatile & !nontemporal
2605                                               - generic
2606                                               - private    1. buffer/global/flat_load
2607                                               - constant
2608                                                          - volatile & !nontemporal
2610                                                            1. buffer/global/flat_load
2611                                                               glc=1
2613                                                          - nontemporal
2615                                                            1. buffer/global/flat_load
2616                                                               glc=1 slc=1
2618      load         *none*       *none*         - local    1. ds_load
2619      store        *none*       *none*         - global   - !nontemporal
2620                                               - generic
2621                                               - private    1. buffer/global/flat_store
2622                                               - constant
2623                                                          - nontemporal
2625                                                            1. buffer/global/flat_stote
2626                                                               glc=1 slc=1
2628      store        *none*       *none*         - local    1. ds_store
2629      **Unordered Atomic**
2630      -----------------------------------------------------------------------------------
2631      load atomic  unordered    *any*          *any*      *Same as non-atomic*.
2632      store atomic unordered    *any*          *any*      *Same as non-atomic*.
2633      atomicrmw    unordered    *any*          *any*      *Same as monotonic
2634                                                          atomic*.
2635      **Monotonic Atomic**
2636      -----------------------------------------------------------------------------------
2637      load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
2638                                - wavefront    - generic
2639                                - workgroup
2640      load atomic  monotonic    - singlethread - local    1. ds_load
2641                                - wavefront
2642                                - workgroup
2643      load atomic  monotonic    - agent        - global   1. buffer/global/flat_load
2644                                - system       - generic     glc=1
2645      store atomic monotonic    - singlethread - global   1. buffer/global/flat_store
2646                                - wavefront    - generic
2647                                - workgroup
2648                                - agent
2649                                - system
2650      store atomic monotonic    - singlethread - local    1. ds_store
2651                                - wavefront
2652                                - workgroup
2653      atomicrmw    monotonic    - singlethread - global   1. buffer/global/flat_atomic
2654                                - wavefront    - generic
2655                                - workgroup
2656                                - agent
2657                                - system
2658      atomicrmw    monotonic    - singlethread - local    1. ds_atomic
2659                                - wavefront
2660                                - workgroup
2661      **Acquire Atomic**
2662      -----------------------------------------------------------------------------------
2663      load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
2664                                - wavefront    - local
2665                                               - generic
2666      load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load
2667      load atomic  acquire      - workgroup    - local    1. ds_load
2668                                                          2. s_waitcnt lgkmcnt(0)
2670                                                            - If OpenCL, omit.
2671                                                            - Must happen before
2672                                                              any following
2673                                                              global/generic
2674                                                              load/load
2675                                                              atomic/store/store
2676                                                              atomic/atomicrmw.
2677                                                            - Ensures any
2678                                                              following global
2679                                                              data read is no
2680                                                              older than the load
2681                                                              atomic value being
2682                                                              acquired.
2683      load atomic  acquire      - workgroup    - generic  1. flat_load
2684                                                          2. s_waitcnt lgkmcnt(0)
2686                                                            - If OpenCL, omit.
2687                                                            - Must happen before
2688                                                              any following
2689                                                              global/generic
2690                                                              load/load
2691                                                              atomic/store/store
2692                                                              atomic/atomicrmw.
2693                                                            - Ensures any
2694                                                              following global
2695                                                              data read is no
2696                                                              older than the load
2697                                                              atomic value being
2698                                                              acquired.
2699      load atomic  acquire      - agent        - global   1. buffer/global/flat_load
2700                                - system                     glc=1
2701                                                          2. s_waitcnt vmcnt(0)
2703                                                            - Must happen before
2704                                                              following
2705                                                              buffer_wbinvl1_vol.
2706                                                            - Ensures the load
2707                                                              has completed
2708                                                              before invalidating
2709                                                              the cache.
2711                                                          3. buffer_wbinvl1_vol
2713                                                            - Must happen before
2714                                                              any following
2715                                                              global/generic
2716                                                              load/load
2717                                                              atomic/atomicrmw.
2718                                                            - Ensures that
2719                                                              following
2720                                                              loads will not see
2721                                                              stale global data.
2723      load atomic  acquire      - agent        - generic  1. flat_load glc=1
2724                                - system                  2. s_waitcnt vmcnt(0) &
2725                                                             lgkmcnt(0)
2727                                                            - If OpenCL omit
2728                                                              lgkmcnt(0).
2729                                                            - Must happen before
2730                                                              following
2731                                                              buffer_wbinvl1_vol.
2732                                                            - Ensures the flat_load
2733                                                              has completed
2734                                                              before invalidating
2735                                                              the cache.
2737                                                          3. buffer_wbinvl1_vol
2739                                                            - Must happen before
2740                                                              any following
2741                                                              global/generic
2742                                                              load/load
2743                                                              atomic/atomicrmw.
2744                                                            - Ensures that
2745                                                              following loads
2746                                                              will not see stale
2747                                                              global data.
2749      atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
2750                                - wavefront    - local
2751                                               - generic
2752      atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic
2753      atomicrmw    acquire      - workgroup    - local    1. ds_atomic
2754                                                          2. waitcnt lgkmcnt(0)
2756                                                            - If OpenCL, omit.
2757                                                            - Must happen before
2758                                                              any following
2759                                                              global/generic
2760                                                              load/load
2761                                                              atomic/store/store
2762                                                              atomic/atomicrmw.
2763                                                            - Ensures any
2764                                                              following global
2765                                                              data read is no
2766                                                              older than the
2767                                                              atomicrmw value
2768                                                              being acquired.
2770      atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
2771                                                          2. waitcnt lgkmcnt(0)
2773                                                            - If OpenCL, omit.
2774                                                            - Must happen before
2775                                                              any following
2776                                                              global/generic
2777                                                              load/load
2778                                                              atomic/store/store
2779                                                              atomic/atomicrmw.
2780                                                            - Ensures any
2781                                                              following global
2782                                                              data read is no
2783                                                              older than the
2784                                                              atomicrmw value
2785                                                              being acquired.
2787      atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic
2788                                - system                  2. s_waitcnt vmcnt(0)
2790                                                            - Must happen before
2791                                                              following
2792                                                              buffer_wbinvl1_vol.
2793                                                            - Ensures the
2794                                                              atomicrmw has
2795                                                              completed before
2796                                                              invalidating the
2797                                                              cache.
2799                                                          3. buffer_wbinvl1_vol
2801                                                            - Must happen before
2802                                                              any following
2803                                                              global/generic
2804                                                              load/load
2805                                                              atomic/atomicrmw.
2806                                                            - Ensures that
2807                                                              following loads
2808                                                              will not see stale
2809                                                              global data.
2811      atomicrmw    acquire      - agent        - generic  1. flat_atomic
2812                                - system                  2. s_waitcnt vmcnt(0) &
2813                                                             lgkmcnt(0)
2815                                                            - If OpenCL, omit
2816                                                              lgkmcnt(0).
2817                                                            - Must happen before
2818                                                              following
2819                                                              buffer_wbinvl1_vol.
2820                                                            - Ensures the
2821                                                              atomicrmw has
2822                                                              completed before
2823                                                              invalidating the
2824                                                              cache.
2826                                                          3. buffer_wbinvl1_vol
2828                                                            - Must happen before
2829                                                              any following
2830                                                              global/generic
2831                                                              load/load
2832                                                              atomic/atomicrmw.
2833                                                            - Ensures that
2834                                                              following loads
2835                                                              will not see stale
2836                                                              global data.
2838      fence        acquire      - singlethread *none*     *none*
2839                                - wavefront
2840      fence        acquire      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
2842                                                            - If OpenCL and
2843                                                              address space is
2844                                                              not generic, omit.
2845                                                            - However, since LLVM
2846                                                              currently has no
2847                                                              address space on
2848                                                              the fence need to
2849                                                              conservatively
2850                                                              always generate. If
2851                                                              fence had an
2852                                                              address space then
2853                                                              set to address
2854                                                              space of OpenCL
2855                                                              fence flag, or to
2856                                                              generic if both
2857                                                              local and global
2858                                                              flags are
2859                                                              specified.
2860                                                            - Must happen after
2861                                                              any preceding
2862                                                              local/generic load
2863                                                              atomic/atomicrmw
2864                                                              with an equal or
2865                                                              wider sync scope
2866                                                              and memory ordering
2867                                                              stronger than
2868                                                              unordered (this is
2869                                                              termed the
2870                                                              fence-paired-atomic).
2871                                                            - Must happen before
2872                                                              any following
2873                                                              global/generic
2874                                                              load/load
2875                                                              atomic/store/store
2876                                                              atomic/atomicrmw.
2877                                                            - Ensures any
2878                                                              following global
2879                                                              data read is no
2880                                                              older than the
2881                                                              value read by the
2882                                                              fence-paired-atomic.
2884      fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
2885                                - system                     vmcnt(0)
2887                                                            - If OpenCL and
2888                                                              address space is
2889                                                              not generic, omit
2890                                                              lgkmcnt(0).
2891                                                            - However, since LLVM
2892                                                              currently has no
2893                                                              address space on
2894                                                              the fence need to
2895                                                              conservatively
2896                                                              always generate
2897                                                              (see comment for
2898                                                              previous fence).
2899                                                            - Could be split into
2900                                                              separate s_waitcnt
2901                                                              vmcnt(0) and
2902                                                              s_waitcnt
2903                                                              lgkmcnt(0) to allow
2904                                                              them to be
2905                                                              independently moved
2906                                                              according to the
2907                                                              following rules.
2908                                                            - s_waitcnt vmcnt(0)
2909                                                              must happen after
2910                                                              any preceding
2911                                                              global/generic load
2912                                                              atomic/atomicrmw
2913                                                              with an equal or
2914                                                              wider sync scope
2915                                                              and memory ordering
2916                                                              stronger than
2917                                                              unordered (this is
2918                                                              termed the
2919                                                              fence-paired-atomic).
2920                                                            - s_waitcnt lgkmcnt(0)
2921                                                              must happen after
2922                                                              any preceding
2923                                                              local/generic load
2924                                                              atomic/atomicrmw
2925                                                              with an equal or
2926                                                              wider sync scope
2927                                                              and memory ordering
2928                                                              stronger than
2929                                                              unordered (this is
2930                                                              termed the
2931                                                              fence-paired-atomic).
2932                                                            - Must happen before
2933                                                              the following
2934                                                              buffer_wbinvl1_vol.
2935                                                            - Ensures that the
2936                                                              fence-paired atomic
2937                                                              has completed
2938                                                              before invalidating
2939                                                              the
2940                                                              cache. Therefore
2941                                                              any following
2942                                                              locations read must
2943                                                              be no older than
2944                                                              the value read by
2945                                                              the
2946                                                              fence-paired-atomic.
2948                                                          2. buffer_wbinvl1_vol
2950                                                            - Must happen before any
2951                                                              following global/generic
2952                                                              load/load
2953                                                              atomic/store/store
2954                                                              atomic/atomicrmw.
2955                                                            - Ensures that
2956                                                              following loads
2957                                                              will not see stale
2958                                                              global data.
2960      **Release Atomic**
2961      -----------------------------------------------------------------------------------
2962      store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
2963                                - wavefront    - local
2964                                               - generic
2965      store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
2967                                                            - If OpenCL, omit.
2968                                                            - Must happen after
2969                                                              any preceding
2970                                                              local/generic
2971                                                              load/store/load
2972                                                              atomic/store
2973                                                              atomic/atomicrmw.
2974                                                            - Must happen before
2975                                                              the following
2976                                                              store.
2977                                                            - Ensures that all
2978                                                              memory operations
2979                                                              to local have
2980                                                              completed before
2981                                                              performing the
2982                                                              store that is being
2983                                                              released.
2985                                                          2. buffer/global/flat_store
2986      store atomic release      - workgroup    - local    1. ds_store
2987      store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
2989                                                            - If OpenCL, omit.
2990                                                            - Must happen after
2991                                                              any preceding
2992                                                              local/generic
2993                                                              load/store/load
2994                                                              atomic/store
2995                                                              atomic/atomicrmw.
2996                                                            - Must happen before
2997                                                              the following
2998                                                              store.
2999                                                            - Ensures that all
3000                                                              memory operations
3001                                                              to local have
3002                                                              completed before
3003                                                              performing the
3004                                                              store that is being
3005                                                              released.
3007                                                          2. flat_store
3008      store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3009                                - system       - generic     vmcnt(0)
3011                                                            - If OpenCL, omit
3012                                                              lgkmcnt(0).
3013                                                            - Could be split into
3014                                                              separate s_waitcnt
3015                                                              vmcnt(0) and
3016                                                              s_waitcnt
3017                                                              lgkmcnt(0) to allow
3018                                                              them to be
3019                                                              independently moved
3020                                                              according to the
3021                                                              following rules.
3022                                                            - s_waitcnt vmcnt(0)
3023                                                              must happen after
3024                                                              any preceding
3025                                                              global/generic
3026                                                              load/store/load
3027                                                              atomic/store
3028                                                              atomic/atomicrmw.
3029                                                            - s_waitcnt lgkmcnt(0)
3030                                                              must happen after
3031                                                              any preceding
3032                                                              local/generic
3033                                                              load/store/load
3034                                                              atomic/store
3035                                                              atomic/atomicrmw.
3036                                                            - Must happen before
3037                                                              the following
3038                                                              store.
3039                                                            - Ensures that all
3040                                                              memory operations
3041                                                              to memory have
3042                                                              completed before
3043                                                              performing the
3044                                                              store that is being
3045                                                              released.
3047                                                          2. buffer/global/ds/flat_store
3048      atomicrmw    release      - singlethread - global   1. buffer/global/ds/flat_atomic
3049                                - wavefront    - local
3050                                               - generic
3051      atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3053                                                            - If OpenCL, omit.
3054                                                            - Must happen after
3055                                                              any preceding
3056                                                              local/generic
3057                                                              load/store/load
3058                                                              atomic/store
3059                                                              atomic/atomicrmw.
3060                                                            - Must happen before
3061                                                              the following
3062                                                              atomicrmw.
3063                                                            - Ensures that all
3064                                                              memory operations
3065                                                              to local have
3066                                                              completed before
3067                                                              performing the
3068                                                              atomicrmw that is
3069                                                              being released.
3071                                                          2. buffer/global/flat_atomic
3072      atomicrmw    release      - workgroup    - local    1. ds_atomic
3073      atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3075                                                            - If OpenCL, omit.
3076                                                            - Must happen after
3077                                                              any preceding
3078                                                              local/generic
3079                                                              load/store/load
3080                                                              atomic/store
3081                                                              atomic/atomicrmw.
3082                                                            - Must happen before
3083                                                              the following
3084                                                              atomicrmw.
3085                                                            - Ensures that all
3086                                                              memory operations
3087                                                              to local have
3088                                                              completed before
3089                                                              performing the
3090                                                              atomicrmw that is
3091                                                              being released.
3093                                                          2. flat_atomic
3094      atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3095                                - system       - generic     vmcnt(0)
3097                                                            - If OpenCL, omit
3098                                                              lgkmcnt(0).
3099                                                            - Could be split into
3100                                                              separate s_waitcnt
3101                                                              vmcnt(0) and
3102                                                              s_waitcnt
3103                                                              lgkmcnt(0) to allow
3104                                                              them to be
3105                                                              independently moved
3106                                                              according to the
3107                                                              following rules.
3108                                                            - s_waitcnt vmcnt(0)
3109                                                              must happen after
3110                                                              any preceding
3111                                                              global/generic
3112                                                              load/store/load
3113                                                              atomic/store
3114                                                              atomic/atomicrmw.
3115                                                            - s_waitcnt lgkmcnt(0)
3116                                                              must happen after
3117                                                              any preceding
3118                                                              local/generic
3119                                                              load/store/load
3120                                                              atomic/store
3121                                                              atomic/atomicrmw.
3122                                                            - Must happen before
3123                                                              the following
3124                                                              atomicrmw.
3125                                                            - Ensures that all
3126                                                              memory operations
3127                                                              to global and local
3128                                                              have completed
3129                                                              before performing
3130                                                              the atomicrmw that
3131                                                              is being released.
3133                                                          2. buffer/global/ds/flat_atomic
3134      fence        release      - singlethread *none*     *none*
3135                                - wavefront
3136      fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3138                                                            - If OpenCL and
3139                                                              address space is
3140                                                              not generic, omit.
3141                                                            - However, since LLVM
3142                                                              currently has no
3143                                                              address space on
3144                                                              the fence need to
3145                                                              conservatively
3146                                                              always generate. If
3147                                                              fence had an
3148                                                              address space then
3149                                                              set to address
3150                                                              space of OpenCL
3151                                                              fence flag, or to
3152                                                              generic if both
3153                                                              local and global
3154                                                              flags are
3155                                                              specified.
3156                                                            - Must happen after
3157                                                              any preceding
3158                                                              local/generic
3159                                                              load/load
3160                                                              atomic/store/store
3161                                                              atomic/atomicrmw.
3162                                                            - Must happen before
3163                                                              any following store
3164                                                              atomic/atomicrmw
3165                                                              with an equal or
3166                                                              wider sync scope
3167                                                              and memory ordering
3168                                                              stronger than
3169                                                              unordered (this is
3170                                                              termed the
3171                                                              fence-paired-atomic).
3172                                                            - Ensures that all
3173                                                              memory operations
3174                                                              to local have
3175                                                              completed before
3176                                                              performing the
3177                                                              following
3178                                                              fence-paired-atomic.
3180      fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3181                                - system                     vmcnt(0)
3183                                                            - If OpenCL and
3184                                                              address space is
3185                                                              not generic, omit
3186                                                              lgkmcnt(0).
3187                                                            - If OpenCL and
3188                                                              address space is
3189                                                              local, omit
3190                                                              vmcnt(0).
3191                                                            - However, since LLVM
3192                                                              currently has no
3193                                                              address space on
3194                                                              the fence need to
3195                                                              conservatively
3196                                                              always generate. If
3197                                                              fence had an
3198                                                              address space then
3199                                                              set to address
3200                                                              space of OpenCL
3201                                                              fence flag, or to
3202                                                              generic if both
3203                                                              local and global
3204                                                              flags are
3205                                                              specified.
3206                                                            - Could be split into
3207                                                              separate s_waitcnt
3208                                                              vmcnt(0) and
3209                                                              s_waitcnt
3210                                                              lgkmcnt(0) to allow
3211                                                              them to be
3212                                                              independently moved
3213                                                              according to the
3214                                                              following rules.
3215                                                            - s_waitcnt vmcnt(0)
3216                                                              must happen after
3217                                                              any preceding
3218                                                              global/generic
3219                                                              load/store/load
3220                                                              atomic/store
3221                                                              atomic/atomicrmw.
3222                                                            - s_waitcnt lgkmcnt(0)
3223                                                              must happen after
3224                                                              any preceding
3225                                                              local/generic
3226                                                              load/store/load
3227                                                              atomic/store
3228                                                              atomic/atomicrmw.
3229                                                            - Must happen before
3230                                                              any following store
3231                                                              atomic/atomicrmw
3232                                                              with an equal or
3233                                                              wider sync scope
3234                                                              and memory ordering
3235                                                              stronger than
3236                                                              unordered (this is
3237                                                              termed the
3238                                                              fence-paired-atomic).
3239                                                            - Ensures that all
3240                                                              memory operations
3241                                                              have
3242                                                              completed before
3243                                                              performing the
3244                                                              following
3245                                                              fence-paired-atomic.
3247      **Acquire-Release Atomic**
3248      -----------------------------------------------------------------------------------
3249      atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
3250                                - wavefront    - local
3251                                               - generic
3252      atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3254                                                            - If OpenCL, omit.
3255                                                            - Must happen after
3256                                                              any preceding
3257                                                              local/generic
3258                                                              load/store/load
3259                                                              atomic/store
3260                                                              atomic/atomicrmw.
3261                                                            - Must happen before
3262                                                              the following
3263                                                              atomicrmw.
3264                                                            - Ensures that all
3265                                                              memory operations
3266                                                              to local have
3267                                                              completed before
3268                                                              performing the
3269                                                              atomicrmw that is
3270                                                              being released.
3272                                                          2. buffer/global/flat_atomic
3273      atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
3274                                                          2. s_waitcnt lgkmcnt(0)
3276                                                            - If OpenCL, omit.
3277                                                            - Must happen before
3278                                                              any following
3279                                                              global/generic
3280                                                              load/load
3281                                                              atomic/store/store
3282                                                              atomic/atomicrmw.
3283                                                            - Ensures any
3284                                                              following global
3285                                                              data read is no
3286                                                              older than the load
3287                                                              atomic value being
3288                                                              acquired.
3290      atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
3292                                                            - If OpenCL, omit.
3293                                                            - Must happen after
3294                                                              any preceding
3295                                                              local/generic
3296                                                              load/store/load
3297                                                              atomic/store
3298                                                              atomic/atomicrmw.
3299                                                            - Must happen before
3300                                                              the following
3301                                                              atomicrmw.
3302                                                            - Ensures that all
3303                                                              memory operations
3304                                                              to local have
3305                                                              completed before
3306                                                              performing the
3307                                                              atomicrmw that is
3308                                                              being released.
3310                                                          2. flat_atomic
3311                                                          3. s_waitcnt lgkmcnt(0)
3313                                                            - If OpenCL, omit.
3314                                                            - Must happen before
3315                                                              any following
3316                                                              global/generic
3317                                                              load/load
3318                                                              atomic/store/store
3319                                                              atomic/atomicrmw.
3320                                                            - Ensures any
3321                                                              following global
3322                                                              data read is no
3323                                                              older than the load
3324                                                              atomic value being
3325                                                              acquired.
3327      atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3328                                - system                     vmcnt(0)
3330                                                            - If OpenCL, omit
3331                                                              lgkmcnt(0).
3332                                                            - Could be split into
3333                                                              separate s_waitcnt
3334                                                              vmcnt(0) and
3335                                                              s_waitcnt
3336                                                              lgkmcnt(0) to allow
3337                                                              them to be
3338                                                              independently moved
3339                                                              according to the
3340                                                              following rules.
3341                                                            - s_waitcnt vmcnt(0)
3342                                                              must happen after
3343                                                              any preceding
3344                                                              global/generic
3345                                                              load/store/load
3346                                                              atomic/store
3347                                                              atomic/atomicrmw.
3348                                                            - s_waitcnt lgkmcnt(0)
3349                                                              must happen after
3350                                                              any preceding
3351                                                              local/generic
3352                                                              load/store/load
3353                                                              atomic/store
3354                                                              atomic/atomicrmw.
3355                                                            - Must happen before
3356                                                              the following
3357                                                              atomicrmw.
3358                                                            - Ensures that all
3359                                                              memory operations
3360                                                              to global have
3361                                                              completed before
3362                                                              performing the
3363                                                              atomicrmw that is
3364                                                              being released.
3366                                                          2. buffer/global/flat_atomic
3367                                                          3. s_waitcnt vmcnt(0)
3369                                                            - Must happen before
3370                                                              following
3371                                                              buffer_wbinvl1_vol.
3372                                                            - Ensures the
3373                                                              atomicrmw has
3374                                                              completed before
3375                                                              invalidating the
3376                                                              cache.
3378                                                          4. buffer_wbinvl1_vol
3380                                                            - Must happen before
3381                                                              any following
3382                                                              global/generic
3383                                                              load/load
3384                                                              atomic/atomicrmw.
3385                                                            - Ensures that
3386                                                              following loads
3387                                                              will not see stale
3388                                                              global data.
3390      atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
3391                                - system                     vmcnt(0)
3393                                                            - If OpenCL, omit
3394                                                              lgkmcnt(0).
3395                                                            - Could be split into
3396                                                              separate s_waitcnt
3397                                                              vmcnt(0) and
3398                                                              s_waitcnt
3399                                                              lgkmcnt(0) to allow
3400                                                              them to be
3401                                                              independently moved
3402                                                              according to the
3403                                                              following rules.
3404                                                            - s_waitcnt vmcnt(0)
3405                                                              must happen after
3406                                                              any preceding
3407                                                              global/generic
3408                                                              load/store/load
3409                                                              atomic/store
3410                                                              atomic/atomicrmw.
3411                                                            - s_waitcnt lgkmcnt(0)
3412                                                              must happen after
3413                                                              any preceding
3414                                                              local/generic
3415                                                              load/store/load
3416                                                              atomic/store
3417                                                              atomic/atomicrmw.
3418                                                            - Must happen before
3419                                                              the following
3420                                                              atomicrmw.
3421                                                            - Ensures that all
3422                                                              memory operations
3423                                                              to global have
3424                                                              completed before
3425                                                              performing the
3426                                                              atomicrmw that is
3427                                                              being released.
3429                                                          2. flat_atomic
3430                                                          3. s_waitcnt vmcnt(0) &
3431                                                             lgkmcnt(0)
3433                                                            - If OpenCL, omit
3434                                                              lgkmcnt(0).
3435                                                            - Must happen before
3436                                                              following
3437                                                              buffer_wbinvl1_vol.
3438                                                            - Ensures the
3439                                                              atomicrmw has
3440                                                              completed before
3441                                                              invalidating the
3442                                                              cache.
3444                                                          4. buffer_wbinvl1_vol
3446                                                            - Must happen before
3447                                                              any following
3448                                                              global/generic
3449                                                              load/load
3450                                                              atomic/atomicrmw.
3451                                                            - Ensures that
3452                                                              following loads
3453                                                              will not see stale
3454                                                              global data.
3456      fence        acq_rel      - singlethread *none*     *none*
3457                                - wavefront
3458      fence        acq_rel      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)
3460                                                            - If OpenCL and
3461                                                              address space is
3462                                                              not generic, omit.
3463                                                            - However,
3464                                                              since LLVM
3465                                                              currently has no
3466                                                              address space on
3467                                                              the fence need to
3468                                                              conservatively
3469                                                              always generate
3470                                                              (see comment for
3471                                                              previous fence).
3472                                                            - Must happen after
3473                                                              any preceding
3474                                                              local/generic
3475                                                              load/load
3476                                                              atomic/store/store
3477                                                              atomic/atomicrmw.
3478                                                            - Must happen before
3479                                                              any following
3480                                                              global/generic
3481                                                              load/load
3482                                                              atomic/store/store
3483                                                              atomic/atomicrmw.
3484                                                            - Ensures that all
3485                                                              memory operations
3486                                                              to local have
3487                                                              completed before
3488                                                              performing any
3489                                                              following global
3490                                                              memory operations.
3491                                                            - Ensures that the
3492                                                              preceding
3493                                                              local/generic load
3494                                                              atomic/atomicrmw
3495                                                              with an equal or
3496                                                              wider sync scope
3497                                                              and memory ordering
3498                                                              stronger than
3499                                                              unordered (this is
3500                                                              termed the
3501                                                              acquire-fence-paired-atomic
3502                                                              ) has completed
3503                                                              before following
3504                                                              global memory
3505                                                              operations. This
3506                                                              satisfies the
3507                                                              requirements of
3508                                                              acquire.
3509                                                            - Ensures that all
3510                                                              previous memory
3511                                                              operations have
3512                                                              completed before a
3513                                                              following
3514                                                              local/generic store
3515                                                              atomic/atomicrmw
3516                                                              with an equal or
3517                                                              wider sync scope
3518                                                              and memory ordering
3519                                                              stronger than
3520                                                              unordered (this is
3521                                                              termed the
3522                                                              release-fence-paired-atomic
3523                                                              ). This satisfies the
3524                                                              requirements of
3525                                                              release.
3527      fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
3528                                - system                     vmcnt(0)
3530                                                            - If OpenCL and
3531                                                              address space is
3532                                                              not generic, omit
3533                                                              lgkmcnt(0).
3534                                                            - However, since LLVM
3535                                                              currently has no
3536                                                              address space on
3537                                                              the fence need to
3538                                                              conservatively
3539                                                              always generate
3540                                                              (see comment for
3541                                                              previous fence).
3542                                                            - Could be split into
3543                                                              separate s_waitcnt
3544                                                              vmcnt(0) and
3545                                                              s_waitcnt
3546                                                              lgkmcnt(0) to allow
3547                                                              them to be
3548                                                              independently moved
3549                                                              according to the
3550                                                              following rules.
3551                                                            - s_waitcnt vmcnt(0)
3552                                                              must happen after
3553                                                              any preceding
3554                                                              global/generic
3555                                                              load/store/load
3556                                                              atomic/store
3557                                                              atomic/atomicrmw.
3558                                                            - s_waitcnt lgkmcnt(0)
3559                                                              must happen after
3560                                                              any preceding
3561                                                              local/generic
3562                                                              load/store/load
3563                                                              atomic/store
3564                                                              atomic/atomicrmw.
3565                                                            - Must happen before
3566                                                              the following
3567                                                              buffer_wbinvl1_vol.
3568                                                            - Ensures that the
3569                                                              preceding
3570                                                              global/local/generic
3571                                                              load
3572                                                              atomic/atomicrmw
3573                                                              with an equal or
3574                                                              wider sync scope
3575                                                              and memory ordering
3576                                                              stronger than
3577                                                              unordered (this is
3578                                                              termed the
3579                                                              acquire-fence-paired-atomic
3580                                                              ) has completed
3581                                                              before invalidating
3582                                                              the cache. This
3583                                                              satisfies the
3584                                                              requirements of
3585                                                              acquire.
3586                                                            - Ensures that all
3587                                                              previous memory
3588                                                              operations have
3589                                                              completed before a
3590                                                              following
3591                                                              global/local/generic
3592                                                              store
3593                                                              atomic/atomicrmw
3594                                                              with an equal or
3595                                                              wider sync scope
3596                                                              and memory ordering
3597                                                              stronger than
3598                                                              unordered (this is
3599                                                              termed the
3600                                                              release-fence-paired-atomic
3601                                                              ). This satisfies the
3602                                                              requirements of
3603                                                              release.
3605                                                          2. buffer_wbinvl1_vol
3607                                                            - Must happen before
3608                                                              any following
3609                                                              global/generic
3610                                                              load/load
3611                                                              atomic/store/store
3612                                                              atomic/atomicrmw.
3613                                                            - Ensures that
3614                                                              following loads
3615                                                              will not see stale
3616                                                              global data. This
3617                                                              satisfies the
3618                                                              requirements of
3619                                                              acquire.
3621      **Sequential Consistent Atomic**
3622      -----------------------------------------------------------------------------------
3623      load atomic  seq_cst      - singlethread - global   *Same as corresponding
3624                                - wavefront    - local    load atomic acquire,
3625                                               - generic  except must generated
3626                                                          all instructions even
3627                                                          for OpenCL.*
3628      load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
3629                                               - generic
3630                                                            - Must
3631                                                              happen after
3632                                                              preceding
3633                                                              global/generic load
3634                                                              atomic/store
3635                                                              atomic/atomicrmw
3636                                                              with memory
3637                                                              ordering of seq_cst
3638                                                              and with equal or
3639                                                              wider sync scope.
3640                                                              (Note that seq_cst
3641                                                              fences have their
3642                                                              own s_waitcnt
3643                                                              lgkmcnt(0) and so do
3644                                                              not need to be
3645                                                              considered.)
3646                                                            - Ensures any
3647                                                              preceding
3648                                                              sequential
3649                                                              consistent local
3650                                                              memory instructions
3651                                                              have completed
3652                                                              before executing
3653                                                              this sequentially
3654                                                              consistent
3655                                                              instruction. This
3656                                                              prevents reordering
3657                                                              a seq_cst store
3658                                                              followed by a
3659                                                              seq_cst load. (Note
3660                                                              that seq_cst is
3661                                                              stronger than
3662                                                              acquire/release as
3663                                                              the reordering of
3664                                                              load acquire
3665                                                              followed by a store
3666                                                              release is
3667                                                              prevented by the
3668                                                              waitcnt of
3669                                                              the release, but
3670                                                              there is nothing
3671                                                              preventing a store
3672                                                              release followed by
3673                                                              load acquire from
3674                                                              competing out of
3675                                                              order.)
3677                                                          2. *Following
3678                                                             instructions same as
3679                                                             corresponding load
3680                                                             atomic acquire,
3681                                                             except must generated
3682                                                             all instructions even
3683                                                             for OpenCL.*
3684      load atomic  seq_cst      - workgroup    - local    *Same as corresponding
3685                                                          load atomic acquire,
3686                                                          except must generated
3687                                                          all instructions even
3688                                                          for OpenCL.*
3689      load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
3690                                - system       - generic     vmcnt(0)
3692                                                            - Could be split into
3693                                                              separate s_waitcnt
3694                                                              vmcnt(0)
3695                                                              and s_waitcnt
3696                                                              lgkmcnt(0) to allow
3697                                                              them to be
3698                                                              independently moved
3699                                                              according to the
3700                                                              following rules.
3701                                                            - waitcnt lgkmcnt(0)
3702                                                              must happen after
3703                                                              preceding
3704                                                              global/generic load
3705                                                              atomic/store
3706                                                              atomic/atomicrmw
3707                                                              with memory
3708                                                              ordering of seq_cst
3709                                                              and with equal or
3710                                                              wider sync scope.
3711                                                              (Note that seq_cst
3712                                                              fences have their
3713                                                              own s_waitcnt
3714                                                              lgkmcnt(0) and so do
3715                                                              not need to be
3716                                                              considered.)
3717                                                            - waitcnt vmcnt(0)
3718                                                              must happen after
3719                                                              preceding
3720                                                              global/generic load
3721                                                              atomic/store
3722                                                              atomic/atomicrmw
3723                                                              with memory
3724                                                              ordering of seq_cst
3725                                                              and with equal or
3726                                                              wider sync scope.
3727                                                              (Note that seq_cst
3728                                                              fences have their
3729                                                              own s_waitcnt
3730                                                              vmcnt(0) and so do
3731                                                              not need to be
3732                                                              considered.)
3733                                                            - Ensures any
3734                                                              preceding
3735                                                              sequential
3736                                                              consistent global
3737                                                              memory instructions
3738                                                              have completed
3739                                                              before executing
3740                                                              this sequentially
3741                                                              consistent
3742                                                              instruction. This
3743                                                              prevents reordering
3744                                                              a seq_cst store
3745                                                              followed by a
3746                                                              seq_cst load. (Note
3747                                                              that seq_cst is
3748                                                              stronger than
3749                                                              acquire/release as
3750                                                              the reordering of
3751                                                              load acquire
3752                                                              followed by a store
3753                                                              release is
3754                                                              prevented by the
3755                                                              waitcnt of
3756                                                              the release, but
3757                                                              there is nothing
3758                                                              preventing a store
3759                                                              release followed by
3760                                                              load acquire from
3761                                                              competing out of
3762                                                              order.)
3764                                                          2. *Following
3765                                                             instructions same as
3766                                                             corresponding load
3767                                                             atomic acquire,
3768                                                             except must generated
3769                                                             all instructions even
3770                                                             for OpenCL.*
3771      store atomic seq_cst      - singlethread - global   *Same as corresponding
3772                                - wavefront    - local    store atomic release,
3773                                - workgroup    - generic  except must generated
3774                                                          all instructions even
3775                                                          for OpenCL.*
3776      store atomic seq_cst      - agent        - global   *Same as corresponding
3777                                - system       - generic  store atomic release,
3778                                                          except must generated
3779                                                          all instructions even
3780                                                          for OpenCL.*
3781      atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
3782                                - wavefront    - local    atomicrmw acq_rel,
3783                                - workgroup    - generic  except must generated
3784                                                          all instructions even
3785                                                          for OpenCL.*
3786      atomicrmw    seq_cst      - agent        - global   *Same as corresponding
3787                                - system       - generic  atomicrmw acq_rel,
3788                                                          except must generated
3789                                                          all instructions even
3790                                                          for OpenCL.*
3791      fence        seq_cst      - singlethread *none*     *Same as corresponding
3792                                - wavefront               fence acq_rel,
3793                                - workgroup               except must generated
3794                                - agent                   all instructions even
3795                                - system                  for OpenCL.*
3796      ============ ============ ============== ========== ===============================
3798 The memory order also adds the single thread optimization constrains defined in
3799 table
3800 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3802   .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3803      :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3805      ============ ==============================================================
3806      LLVM Memory  Optimization Constraints
3807      Ordering
3808      ============ ==============================================================
3809      unordered    *none*
3810      monotonic    *none*
3811      acquire      - If a load atomic/atomicrmw then no following load/load
3812                     atomic/store/ store atomic/atomicrmw/fence instruction can
3813                     be moved before the acquire.
3814                   - If a fence then same as load atomic, plus no preceding
3815                     associated fence-paired-atomic can be moved after the fence.
3816      release      - If a store atomic/atomicrmw then no preceding load/load
3817                     atomic/store/ store atomic/atomicrmw/fence instruction can
3818                     be moved after the release.
3819                   - If a fence then same as store atomic, plus no following
3820                     associated fence-paired-atomic can be moved before the
3821                     fence.
3822      acq_rel      Same constraints as both acquire and release.
3823      seq_cst      - If a load atomic then same constraints as acquire, plus no
3824                     preceding sequentially consistent load atomic/store
3825                     atomic/atomicrmw/fence instruction can be moved after the
3826                     seq_cst.
3827                   - If a store atomic then the same constraints as release, plus
3828                     no following sequentially consistent load atomic/store
3829                     atomic/atomicrmw/fence instruction can be moved before the
3830                     seq_cst.
3831                   - If an atomicrmw/fence then same constraints as acq_rel.
3832      ============ ==============================================================
3834 Trap Handler ABI
3835 ~~~~~~~~~~~~~~~~
3837 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3838 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3839 the ``s_trap`` instruction with the following usage:
3841   .. table:: AMDGPU Trap Handler for AMDHSA OS
3842      :name: amdgpu-trap-handler-for-amdhsa-os-table
3844      =================== =============== =============== =======================
3845      Usage               Code Sequence   Trap Handler    Description
3846                                          Inputs
3847      =================== =============== =============== =======================
3848      reserved            ``s_trap 0x00``                 Reserved by hardware.
3849      ``debugtrap(arg)``  ``s_trap 0x01`` ``SGPR0-1``:    Reserved for HSA
3850                                            ``queue_ptr`` ``debugtrap``
3851                                          ``VGPR0``:      intrinsic (not
3852                                            ``arg``       implemented).
3853      ``llvm.trap``       ``s_trap 0x02`` ``SGPR0-1``:    Causes dispatch to be
3854                                            ``queue_ptr`` terminated and its
3855                                                          associated queue put
3856                                                          into the error state.
3857      ``llvm.debugtrap``  ``s_trap 0x03``                 - If debugger not
3858                                                            installed then
3859                                                            behaves as a
3860                                                            no-operation. The
3861                                                            trap handler is
3862                                                            entered and
3863                                                            immediately returns
3864                                                            to continue
3865                                                            execution of the
3866                                                            wavefront.
3867                                                          - If the debugger is
3868                                                            installed, causes
3869                                                            the debug trap to be
3870                                                            reported by the
3871                                                            debugger and the
3872                                                            wavefront is put in
3873                                                            the halt state until
3874                                                            resumed by the
3875                                                            debugger.
3876      reserved            ``s_trap 0x04``                 Reserved.
3877      reserved            ``s_trap 0x05``                 Reserved.
3878      reserved            ``s_trap 0x06``                 Reserved.
3879      debugger breakpoint ``s_trap 0x07``                 Reserved for debugger
3880                                                          breakpoints.
3881      reserved            ``s_trap 0x08``                 Reserved.
3882      reserved            ``s_trap 0xfe``                 Reserved.
3883      reserved            ``s_trap 0xff``                 Reserved.
3884      =================== =============== =============== =======================
3886 AMDPAL
3887 ------
3889 This section provides code conventions used when the target triple OS is
3890 ``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
3891 from the application/runtime to each invocation of a hardware shader. These
3892 parameters include both generic, application-controlled parameters called
3893 *user data* as well as system-generated parameters that are a product of the
3894 draw or dispatch execution.
3896 User Data
3897 ~~~~~~~~~
3899 Each hardware stage has a set of 32-bit *user data registers* which can be
3900 written from a command buffer and then loaded into SGPRs when waves are launched
3901 via a subsequent dispatch or draw operation. This is the way most arguments are
3902 passed from the application/runtime to a hardware shader.
3904 Compute User Data
3905 ~~~~~~~~~~~~~~~~~
3907 Compute shader user data mappings are simpler than graphics shaders, and have a
3908 fixed mapping.
3910 Note that there are always 10 available *user data entries* in registers -
3911 entries beyond that limit must be fetched from memory (via the spill table
3912 pointer) by the shader.
3914   .. table:: PAL Compute Shader User Data Registers
3915      :name: pal-compute-user-data-registers
3917      ============= ================================
3918      User Register Description
3919      ============= ================================
3920      0             Global Internal Table (32-bit pointer)
3921      1             Per-Shader Internal Table (32-bit pointer)
3922      2 - 11        Application-Controlled User Data (10 32-bit values)
3923      12            Spill Table (32-bit pointer)
3924      13 - 14       Thread Group Count (64-bit pointer)
3925      15            GDS Range
3926      ============= ================================
3928 Graphics User Data
3929 ~~~~~~~~~~~~~~~~~~
3931 Graphics pipelines support a much more flexible user data mapping:
3933   .. table:: PAL Graphics Shader User Data Registers
3934      :name: pal-graphics-user-data-registers
3936      ============= ================================
3937      User Register Description
3938      ============= ================================
3939      0             Global Internal Table (32-bit pointer)
3940      +             Per-Shader Internal Table (32-bit pointer)
3941      + 1-15        Application Controlled User Data
3942                    (1-15 Contiguous 32-bit Values in Registers)
3943      +             Spill Table (32-bit pointer)
3944      +             Draw Index (First Stage Only)
3945      +             Vertex Offset (First Stage Only)
3946      +             Instance Offset (First Stage Only)
3947      ============= ================================
3949   The placement of the global internal table remains fixed in the first *user
3950   data SGPR register*. Otherwise all parameters are optional, and can be mapped
3951   to any desired *user data SGPR register*, with the following regstrictions:
3953   * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
3954     activehardware stage in a graphics pipeline (i.e. where the API vertex
3955     shader runs).
3957   * Application-controlled user data must be mapped into a contiguous range of
3958     user data registers.
3960   * The application-controlled user data range supports compaction remapping, so
3961     only *entries* that are actually consumed by the shader must be assigned to
3962     corresponding *registers*. Note that in order to support an efficient runtime
3963     implementation, the remapping must pack *registers* in the same order as
3964     *entries*, with unused *entries* removed.
3966 .. _pal_global_internal_table:
3968 Global Internal Table
3969 ~~~~~~~~~~~~~~~~~~~~~
3971 The global internal table is a table of *shader resource descriptors* (SRDs) that
3972 define how certain engine-wide, runtime-managed resources should be accessed
3973 from a shader. The majority of these resources have HW-defined formats, and it
3974 is up to the compiler to write/read data as required by the target hardware.
3976 The following table illustrates the required format:
3978   .. table:: PAL Global Internal Table
3979      :name: pal-git-table
3981      ============= ================================
3982      Offset        Description
3983      ============= ================================
3984      0-3           Graphics Scratch SRD
3985      4-7           Compute Scratch SRD
3986      8-11          ES/GS Ring Output SRD
3987      12-15         ES/GS Ring Input SRD
3988      16-19         GS/VS Ring Output #0
3989      20-23         GS/VS Ring Output #1
3990      24-27         GS/VS Ring Output #2
3991      28-31         GS/VS Ring Output #3
3992      32-35         GS/VS Ring Input SRD
3993      36-39         Tessellation Factor Buffer SRD
3994      40-43         Off-Chip LDS Buffer SRD
3995      44-47         Off-Chip Param Cache Buffer SRD
3996      48-51         Sample Position Buffer SRD
3997      52            vaRange::ShadowDescriptorTable High Bits
3998      ============= ================================
4000   The pointer to the global internal table passed to the shader as user data
4001   is a 32-bit pointer. The top 32 bits should be assumed to be the same as
4002   the top 32 bits of the pipeline, so the shader may use the program
4003   counter's top 32 bits.
4005 Unspecified OS
4006 --------------
4008 This section provides code conventions used when the target triple OS is
4009 empty (see :ref:`amdgpu-target-triples`).
4011 Trap Handler ABI
4012 ~~~~~~~~~~~~~~~~
4014 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
4015 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
4016 instructions are handled as follows:
4018   .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
4019      :name: amdgpu-trap-handler-for-non-amdhsa-os-table
4021      =============== =============== ===========================================
4022      Usage           Code Sequence   Description
4023      =============== =============== ===========================================
4024      llvm.trap       s_endpgm        Causes wavefront to be terminated.
4025      llvm.debugtrap  *none*          Compiler warning given that there is no
4026                                      trap handler installed.
4027      =============== =============== ===========================================
4029 Source Languages
4030 ================
4032 .. _amdgpu-opencl:
4034 OpenCL
4035 ------
4037 When the language is OpenCL the following differences occur:
4039 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
4040 2. The AMDGPU backend appends additional arguments to the kernel's explicit
4041    arguments for the AMDHSA OS (see
4042    :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
4043 3. Additional metadata is generated
4044    (see :ref:`amdgpu-amdhsa-code-object-metadata`).
4046   .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
4047      :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
4049      ======== ==== ========= ===========================================
4050      Position Byte Byte      Description
4051               Size Alignment
4052      ======== ==== ========= ===========================================
4053      1        8    8         OpenCL Global Offset X
4054      2        8    8         OpenCL Global Offset Y
4055      3        8    8         OpenCL Global Offset Z
4056      4        8    8         OpenCL address of printf buffer
4057      5        8    8         OpenCL address of virtual queue used by
4058                              enqueue_kernel.
4059      6        8    8         OpenCL address of AqlWrap struct used by
4060                              enqueue_kernel.
4061      ======== ==== ========= ===========================================
4063 .. _amdgpu-hcc:
4068 When the language is HCC the following differences occur:
4070 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
4072 .. _amdgpu-assembler:
4074 Assembler
4075 ---------
4077 AMDGPU backend has LLVM-MC based assembler which is currently in development.
4078 It supports AMDGCN GFX6-GFX9.
4080 This section describes general syntax for instructions and operands.
4082 Instructions
4083 ~~~~~~~~~~~~
4085 .. toctree::
4086    :hidden:
4088    AMDGPUAsmGFX7
4089    AMDGPUAsmGFX8
4090    AMDGPUAsmGFX9
4091    AMDGPUOperandSyntax
4093 An instruction has the following syntax:
4095     *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
4097 Note that operands are normally comma-separated while modifiers are space-separated.
4099 The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
4101 See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
4102 :doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
4104 Note that features under development are not included in this description.
4106 For more information about instructions, their semantics and supported combinations of
4107 operands, refer to one of instruction set architecture manuals
4108 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
4110 Operands
4111 ~~~~~~~~
4113 The following syntax for register operands is supported:
4115 * SGPR registers: s0, ... or s[0], ...
4116 * VGPR registers: v0, ... or v[0], ...
4117 * TTMP registers: ttmp0, ... or ttmp[0], ...
4118 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
4119 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
4120 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
4121 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
4122 * Register index expressions: v[2*2], s[1-1:2-1]
4123 * 'off' indicates that an operand is not enabled
4125 Modifiers
4126 ~~~~~~~~~
4128 Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
4130 Instruction Examples
4131 ~~~~~~~~~~~~~~~~~~~~
4136 .. code-block:: nasm
4138   ds_add_u32 v2, v4 offset:16
4139   ds_write_src2_b64 v2 offset0:4 offset1:8
4140   ds_cmpst_f32 v2, v4, v6
4141   ds_min_rtn_f64 v[8:9], v2, v[4:5]
4144 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
4146 FLAT
4147 ++++
4149 .. code-block:: nasm
4151   flat_load_dword v1, v[3:4]
4152   flat_store_dwordx3 v[3:4], v[5:7]
4153   flat_atomic_swap v1, v[3:4], v5 glc
4154   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
4155   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
4157 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
4159 MUBUF
4160 +++++
4162 .. code-block:: nasm
4164   buffer_load_dword v1, off, s[4:7], s1
4165   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
4166   buffer_store_format_xy v[1:2], off, s[4:7], s1
4167   buffer_wbinvl1
4168   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
4170 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
4172 SMRD/SMEM
4173 +++++++++
4175 .. code-block:: nasm
4177   s_load_dword s1, s[2:3], 0xfc
4178   s_load_dwordx8 s[8:15], s[2:3], s4
4179   s_load_dwordx16 s[88:103], s[2:3], s4
4180   s_dcache_inv_vol
4181   s_memtime s[4:5]
4183 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
4185 SOP1
4186 ++++
4188 .. code-block:: nasm
4190   s_mov_b32 s1, s2
4191   s_mov_b64 s[0:1], 0x80000000
4192   s_cmov_b32 s1, 200
4193   s_wqm_b64 s[2:3], s[4:5]
4194   s_bcnt0_i32_b64 s1, s[2:3]
4195   s_swappc_b64 s[2:3], s[4:5]
4196   s_cbranch_join s[4:5]
4198 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
4200 SOP2
4201 ++++
4203 .. code-block:: nasm
4205   s_add_u32 s1, s2, s3
4206   s_and_b64 s[2:3], s[4:5], s[6:7]
4207   s_cselect_b32 s1, s2, s3
4208   s_andn2_b32 s2, s4, s6
4209   s_lshr_b64 s[2:3], s[4:5], s6
4210   s_ashr_i32 s2, s4, s6
4211   s_bfm_b64 s[2:3], s4, s6
4212   s_bfe_i64 s[2:3], s[4:5], s6
4213   s_cbranch_g_fork s[4:5], s[6:7]
4215 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
4217 SOPC
4218 ++++
4220 .. code-block:: nasm
4222   s_cmp_eq_i32 s1, s2
4223   s_bitcmp1_b32 s1, s2
4224   s_bitcmp0_b64 s[2:3], s4
4225   s_setvskip s3, s5
4227 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
4229 SOPP
4230 ++++
4232 .. code-block:: nasm
4234   s_barrier
4235   s_nop 2
4236   s_endpgm
4237   s_waitcnt 0 ; Wait for all counters to be 0
4238   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
4239   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
4240   s_sethalt 9
4241   s_sleep 10
4242   s_sendmsg 0x1
4243   s_sendmsg sendmsg(MSG_INTERRUPT)
4244   s_trap 1
4246 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
4248 Unless otherwise mentioned, little verification is performed on the operands
4249 of SOPP Instructions, so it is up to the programmer to be familiar with the
4250 range or acceptable values.
4252 VALU
4253 ++++
4255 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4256 the assembler will automatically use optimal encoding based on its operands.
4257 To force specific encoding, one can add a suffix to the opcode of the instruction:
4259 * _e32 for 32-bit VOP1/VOP2/VOPC
4260 * _e64 for 64-bit VOP3
4261 * _dpp for VOP_DPP
4262 * _sdwa for VOP_SDWA
4264 VOP1/VOP2/VOP3/VOPC examples:
4266 .. code-block:: nasm
4268   v_mov_b32 v1, v2
4269   v_mov_b32_e32 v1, v2
4270   v_nop
4271   v_cvt_f64_i32_e32 v[1:2], v2
4272   v_floor_f32_e32 v1, v2
4273   v_bfrev_b32_e32 v1, v2
4274   v_add_f32_e32 v1, v2, v3
4275   v_mul_i32_i24_e64 v1, v2, 3
4276   v_mul_i32_i24_e32 v1, -3, v3
4277   v_mul_i32_i24_e32 v1, -100, v3
4278   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4279   v_max_f16_e32 v1, v2, v3
4281 VOP_DPP examples:
4283 .. code-block:: nasm
4285   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4286   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4287   v_mov_b32 v0, v0 wave_shl:1
4288   v_mov_b32 v0, v0 row_mirror
4289   v_mov_b32 v0, v0 row_bcast:31
4290   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4291   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4292   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4294 VOP_SDWA examples:
4296 .. code-block:: nasm
4298   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4299   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4300   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4301   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4302   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4304 For full list of supported instructions, refer to "Vector ALU instructions".
4306 .. TODO
4307    Remove once we switch to code object v3 by default.
4309 HSA Code Object Directives
4310 ~~~~~~~~~~~~~~~~~~~~~~~~~~
4312 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4313 one can specify them with assembler directives.
4315 .hsa_code_object_version major, minor
4316 +++++++++++++++++++++++++++++++++++++
4318 *major* and *minor* are integers that specify the version of the HSA code
4319 object that will be generated by the assembler.
4321 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
4322 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4325 *major*, *minor*, and *stepping* are all integers that describe the instruction
4326 set architecture (ISA) version of the assembly program.
4328 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
4329 "AMD" and *arch* should always be equal to "AMDGPU".
4331 By default, the assembler will derive the ISA version, *vendor*, and *arch*
4332 from the value of the -mcpu option that is passed to the assembler.
4334 .amdgpu_hsa_kernel (name)
4335 +++++++++++++++++++++++++
4337 This directives specifies that the symbol with given name is a kernel entry point
4338 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4340 .amd_kernel_code_t
4341 ++++++++++++++++++
4343 This directive marks the beginning of a list of key / value pairs that are used
4344 to specify the amd_kernel_code_t object that will be emitted by the assembler.
4345 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
4346 any amd_kernel_code_t values that are unspecified a default value will be
4347 used.  The default value for all keys is 0, with the following exceptions:
4349 - *kernel_code_version_major* defaults to 1.
4350 - *machine_kind* defaults to 1.
4351 - *machine_version_major*, *machine_version_minor*, and
4352   *machine_version_stepping* are derived from the value of the -mcpu option
4353   that is passed to the assembler.
4354 - *kernel_code_entry_byte_offset* defaults to 256.
4355 - *wavefront_size* defaults to 6.
4356 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4357   *private_segment_alignment* default to 4. Note that alignments are specified
4358   as a power of two, so a value of **n** means an alignment of 2^ **n**.
4360 The *.amd_kernel_code_t* directive must be placed immediately after the
4361 function label and before any instructions.
4363 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4364 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4366 Here is an example of a minimal amd_kernel_code_t specification:
4368 .. code-block:: none
4370    .hsa_code_object_version 1,0
4371    .hsa_code_object_isa
4373    .hsatext
4374    .globl  hello_world
4375    .p2align 8
4376    .amdgpu_hsa_kernel hello_world
4378    hello_world:
4380       .amd_kernel_code_t
4381          enable_sgpr_kernarg_segment_ptr = 1
4382          is_ptr64 = 1
4383          compute_pgm_rsrc1_vgprs = 0
4384          compute_pgm_rsrc1_sgprs = 0
4385          compute_pgm_rsrc2_user_sgpr = 2
4386          kernarg_segment_byte_size = 8
4387          wavefront_sgpr_count = 2
4388          workitem_vgpr_count = 3
4389      .end_amd_kernel_code_t
4391      s_load_dwordx2 s[0:1], s[0:1] 0x0
4392      v_mov_b32 v0, 3.14159
4393      s_waitcnt lgkmcnt(0)
4394      v_mov_b32 v1, s0
4395      v_mov_b32 v2, s1
4396      flat_store_dword v[1:2], v0
4397      s_endpgm
4398    .Lfunc_end0:
4399         .size   hello_world, .Lfunc_end0-hello_world
4401 Predefined Symbols (-mattr=+code-object-v3)
4402 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4404 The AMDGPU assembler defines and updates some symbols automatically. These
4405 symbols do not affect code generation.
4407 .amdgcn.gfx_generation_number
4408 +++++++++++++++++++++++++++++
4410 Set to the GFX generation number of the target being assembled for. For
4411 example, when assembling for a "GFX9" target this will be set to the integer
4412 value "9". The possible GFX generation numbers are presented in
4413 :ref:`amdgpu-processors`.
4415 .amdgcn.next_free_vgpr
4416 ++++++++++++++++++++++
4418 Set to zero before assembly begins. At each instruction, if the current value
4419 of this symbol is less than or equal to the maximum VGPR number explicitly
4420 referenced within that instruction then the symbol value is updated to equal
4421 that VGPR number plus one.
4423 May be used to set the `.amdhsa_next_free_vpgr` directive in
4424 :ref:`amdhsa-kernel-directives-table`.
4426 May be set at any time, e.g. manually set to zero at the start of each kernel.
4428 .amdgcn.next_free_sgpr
4429 ++++++++++++++++++++++
4431 Set to zero before assembly begins. At each instruction, if the current value
4432 of this symbol is less than or equal the maximum SGPR number explicitly
4433 referenced within that instruction then the symbol value is updated to equal
4434 that SGPR number plus one.
4436 May be used to set the `.amdhsa_next_free_spgr` directive in
4437 :ref:`amdhsa-kernel-directives-table`.
4439 May be set at any time, e.g. manually set to zero at the start of each kernel.
4441 Code Object Directives (-mattr=+code-object-v3)
4442 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4444 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
4445 architecture processors, and are not OS-specific. Directives which begin with
4446 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
4447 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
4448 :ref:`amdgpu-processors`.
4450 .amdgcn_target <target>
4451 +++++++++++++++++++++++
4453 Optional directive which declares the target supported by the containing
4454 assembler source file. Valid values are described in
4455 :ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
4456 to validate command-line options such as ``-triple``, ``-mcpu``, and those
4457 which specify target features.
4459 .amdhsa_kernel <name>
4460 +++++++++++++++++++++
4462 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
4463 ``<name>.kd``, in the current location of the current section. Only valid when
4464 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
4465 instruction to execute, and does not need to be previously defined.
4467 Marks the beginning of a list of directives used to generate the bytes of a
4468 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
4469 Directives which may appear in this list are described in
4470 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
4471 be valid for the target being assembled for, and cannot be repeated. Directives
4472 support the range of values specified by the field they reference in
4473 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
4474 assumed to have its default value, unless it is marked as "Required", in which
4475 case it is an error to omit the directive. This list of directives is
4476 terminated by an ``.end_amdhsa_kernel`` directive.
4478   .. table:: AMDHSA Kernel Assembler Directives
4479      :name: amdhsa-kernel-directives-table
4481      ======================================================== ================ ============ ===================
4482      Directive                                                Default          Supported On Description
4483      ======================================================== ================ ============ ===================
4484      ``.amdhsa_group_segment_fixed_size``                     0                GFX6-GFX9    Controls GROUP_SEGMENT_FIXED_SIZE in
4485                                                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4486      ``.amdhsa_private_segment_fixed_size``                   0                GFX6-GFX9    Controls PRIVATE_SEGMENT_FIXED_SIZE in
4487                                                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4488      ``.amdhsa_user_sgpr_private_segment_buffer``             0                GFX6-GFX9    Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
4489                                                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4490      ``.amdhsa_user_sgpr_dispatch_ptr``                       0                GFX6-GFX9    Controls ENABLE_SGPR_DISPATCH_PTR in
4491                                                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4492      ``.amdhsa_user_sgpr_queue_ptr``                          0                GFX6-GFX9    Controls ENABLE_SGPR_QUEUE_PTR in
4493                                                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4494      ``.amdhsa_user_sgpr_kernarg_segment_ptr``                0                GFX6-GFX9    Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
4495                                                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4496      ``.amdhsa_user_sgpr_dispatch_id``                        0                GFX6-GFX9    Controls ENABLE_SGPR_DISPATCH_ID in
4497                                                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4498      ``.amdhsa_user_sgpr_flat_scratch_init``                  0                GFX6-GFX9    Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
4499                                                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4500      ``.amdhsa_user_sgpr_private_segment_size``               0                GFX6-GFX9    Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
4501                                                                                             :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4502      ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0                GFX6-GFX9    Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
4503                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4504      ``.amdhsa_system_sgpr_workgroup_id_x``                   1                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_ID_X in
4505                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4506      ``.amdhsa_system_sgpr_workgroup_id_y``                   0                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_ID_Y in
4507                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4508      ``.amdhsa_system_sgpr_workgroup_id_z``                   0                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_ID_Z in
4509                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4510      ``.amdhsa_system_sgpr_workgroup_info``                   0                GFX6-GFX9    Controls ENABLE_SGPR_WORKGROUP_INFO in
4511                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4512      ``.amdhsa_system_vgpr_workitem_id``                      0                GFX6-GFX9    Controls ENABLE_VGPR_WORKITEM_ID in
4513                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4514                                                                                             Possible values are defined in
4515                                                                                             :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
4516      ``.amdhsa_next_free_vgpr``                               Required         GFX6-GFX9    Maximum VGPR number explicitly referenced, plus one.
4517                                                                                             Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
4518                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4519      ``.amdhsa_next_free_sgpr``                               Required         GFX6-GFX9    Maximum SGPR number explicitly referenced, plus one.
4520                                                                                             Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4521                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4522      ``.amdhsa_reserve_vcc``                                  1                GFX6-GFX9    Whether the kernel may use the special VCC SGPR.
4523                                                                                             Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4524                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4525      ``.amdhsa_reserve_flat_scratch``                         1                GFX7-GFX9    Whether the kernel may use flat instructions to access
4526                                                                                             scratch memory. Used to calculate
4527                                                                                             GRANULATED_WAVEFRONT_SGPR_COUNT in
4528                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4529      ``.amdhsa_reserve_xnack_mask``                           Target           GFX8-GFX9    Whether the kernel may trigger XNACK replay.
4530                                                               Feature                       Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4531                                                               Specific                      :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4532                                                               (+xnack)
4533      ``.amdhsa_float_round_mode_32``                          0                GFX6-GFX9    Controls FLOAT_ROUND_MODE_32 in
4534                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4535                                                                                             Possible values are defined in
4536                                                                                             :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4537      ``.amdhsa_float_round_mode_16_64``                       0                GFX6-GFX9    Controls FLOAT_ROUND_MODE_16_64 in
4538                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4539                                                                                             Possible values are defined in
4540                                                                                             :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4541      ``.amdhsa_float_denorm_mode_32``                         0                GFX6-GFX9    Controls FLOAT_DENORM_MODE_32 in
4542                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4543                                                                                             Possible values are defined in
4544                                                                                             :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4545      ``.amdhsa_float_denorm_mode_16_64``                      3                GFX6-GFX9    Controls FLOAT_DENORM_MODE_16_64 in
4546                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4547                                                                                             Possible values are defined in
4548                                                                                             :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4549      ``.amdhsa_dx10_clamp``                                   1                GFX6-GFX9    Controls ENABLE_DX10_CLAMP in
4550                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4551      ``.amdhsa_ieee_mode``                                    1                GFX6-GFX9    Controls ENABLE_IEEE_MODE in
4552                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4553      ``.amdhsa_fp16_overflow``                                0                GFX9         Controls FP16_OVFL in
4554                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4555      ``.amdhsa_exception_fp_ieee_invalid_op``                 0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
4556                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4557      ``.amdhsa_exception_fp_denorm_src``                      0                GFX6-GFX9    Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
4558                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4559      ``.amdhsa_exception_fp_ieee_div_zero``                   0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
4560                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4561      ``.amdhsa_exception_fp_ieee_overflow``                   0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
4562                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4563      ``.amdhsa_exception_fp_ieee_underflow``                  0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
4564                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4565      ``.amdhsa_exception_fp_ieee_inexact``                    0                GFX6-GFX9    Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
4566                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4567      ``.amdhsa_exception_int_div_zero``                       0                GFX6-GFX9    Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
4568                                                                                             :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4569      ======================================================== ================ ============ ===================
4571 Example HSA Source Code (-mattr=+code-object-v3)
4572 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4574 Here is an example of a minimal assembly source file, defining one HSA kernel:
4576 .. code-block:: none
4578   .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
4580   .text
4581   .globl hello_world
4582   .p2align 8
4583   .type hello_world,@function
4584   hello_world:
4585     s_load_dwordx2 s[0:1], s[0:1] 0x0
4586     v_mov_b32 v0, 3.14159
4587     s_waitcnt lgkmcnt(0)
4588     v_mov_b32 v1, s0
4589     v_mov_b32 v2, s1
4590     flat_store_dword v[1:2], v0
4591     s_endpgm
4592   .Lfunc_end0:
4593     .size   hello_world, .Lfunc_end0-hello_world
4595   .rodata
4596   .p2align 6
4597   .amdhsa_kernel hello_world
4598     .amdhsa_user_sgpr_kernarg_segment_ptr 1
4599     .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
4600     .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
4601   .end_amdhsa_kernel
4604 Additional Documentation
4605 ========================
4607 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4608 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4609 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4610 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4611 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4612 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4613 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4614 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4615 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4616 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4617 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4618 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4619 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4620 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4621 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4622 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4623 .. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__