1 =============================
2 User Guide for AMDGPU Backend
3 =============================
14 AMDGPU/AMDGPUAsmGFX900
15 AMDGPU/AMDGPUAsmGFX904
16 AMDGPU/AMDGPUAsmGFX906
17 AMDGPU/AMDGPUAsmGFX908
18 AMDGPU/AMDGPUAsmGFX90a
19 AMDGPU/AMDGPUAsmGFX940
21 AMDGPU/AMDGPUAsmGFX1011
22 AMDGPU/AMDGPUAsmGFX1013
23 AMDGPU/AMDGPUAsmGFX1030
27 AMDGPUInstructionSyntax
28 AMDGPUInstructionNotation
29 AMDGPUDwarfExtensionsForHeterogeneousDebugging
30 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
35 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
36 R600 family up until the current GCN families. It lives in the
37 ``llvm/lib/Target/AMDGPU`` directory.
42 .. _amdgpu-target-triples:
47 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
48 to specify the target triple:
50 .. table:: AMDGPU Architectures
51 :name: amdgpu-architecture-table
53 ============ ==============================================================
54 Architecture Description
55 ============ ==============================================================
56 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
57 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
58 ============ ==============================================================
60 .. table:: AMDGPU Vendors
61 :name: amdgpu-vendor-table
63 ============ ==============================================================
65 ============ ==============================================================
66 ``amd`` Can be used for all AMD GPU usage.
67 ``mesa3d`` Can be used if the OS is ``mesa3d``.
68 ============ ==============================================================
70 .. table:: AMDGPU Operating Systems
73 ============== ============================================================
75 ============== ============================================================
76 *<empty>* Defaults to the *unknown* OS.
77 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
80 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
81 loader on Linux. See *AMD ROCm Platform Release Notes*
82 [AMD-ROCm-Release-Notes]_ for supported hardware and
84 - AMD's PAL runtime using the *pal-amdhsa* loader on
87 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL
88 runtime using the *pal-amdpal* loader on Windows and Linux
90 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa
91 3D runtime using the *mesa-mesa3d* loader on Linux.
92 ============== ============================================================
94 .. table:: AMDGPU Environments
95 :name: amdgpu-environment-table
97 ============ ==============================================================
98 Environment Description
99 ============ ==============================================================
101 ============ ==============================================================
103 .. _amdgpu-processors:
108 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
109 specify the AMDGPU processor together with optional target features. See
110 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
111 specific information.
113 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
115 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
118 .. table:: AMDGPU Processors
119 :name: amdgpu-processor-table
121 =========== =============== ============ ===== ================= =============== =============== ======================
122 Processor Alternative Target dGPU/ Target Target OS Support Example
123 Processor Triple APU Features Properties *(see* Products
124 Architecture Supported `amdgpu-os`_
133 =========== =============== ============ ===== ================= =============== =============== ======================
134 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
135 -----------------------------------------------------------------------------------------------------------------------
136 ``r600`` ``r600`` dGPU - Does not
141 ``r630`` ``r600`` dGPU - Does not
146 ``rs880`` ``r600`` dGPU - Does not
151 ``rv670`` ``r600`` dGPU - Does not
156 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
157 -----------------------------------------------------------------------------------------------------------------------
158 ``rv710`` ``r600`` dGPU - Does not
163 ``rv730`` ``r600`` dGPU - Does not
168 ``rv770`` ``r600`` dGPU - Does not
173 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
174 -----------------------------------------------------------------------------------------------------------------------
175 ``cedar`` ``r600`` dGPU - Does not
180 ``cypress`` ``r600`` dGPU - Does not
185 ``juniper`` ``r600`` dGPU - Does not
190 ``redwood`` ``r600`` dGPU - Does not
195 ``sumo`` ``r600`` dGPU - Does not
200 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
201 -----------------------------------------------------------------------------------------------------------------------
202 ``barts`` ``r600`` dGPU - Does not
207 ``caicos`` ``r600`` dGPU - Does not
212 ``cayman`` ``r600`` dGPU - Does not
217 ``turks`` ``r600`` dGPU - Does not
222 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
223 -----------------------------------------------------------------------------------------------------------------------
224 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
229 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
234 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
239 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
240 -----------------------------------------------------------------------------------------------------------------------
241 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000
242 flat - *pal-amdhsa* - A6 Pro-7050B
243 scratch - *pal-amdpal* - A8-7100
251 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100
252 flat - *pal-amdhsa* - FirePro W9100
253 scratch - *pal-amdpal* - FirePro S9150
255 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290
256 flat - *pal-amdhsa* - Radeon R9 290x
257 scratch - *pal-amdpal* - Radeon R390
259 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100
260 - ``mullins`` flat - *pal-amdpal* - E1-2200
268 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790
269 flat - *pal-amdpal* - Radeon HD 8770
272 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA*
279 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
280 -----------------------------------------------------------------------------------------------------------------------
281 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P
282 flat - *pal-amdhsa* - Pro A6-8500B
283 scratch - *pal-amdpal* - A8-8600P
299 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285
300 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380
301 scratch - *pal-amdpal* - Radeon R9 385
302 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano
303 - *pal-amdhsa* - Radeon R9 Fury
304 - *pal-amdpal* - Radeon R9 FuryX
307 - Radeon Instinct MI8
308 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470
309 flat - *pal-amdhsa* - Radeon RX 480
310 scratch - *pal-amdpal* - Radeon Instinct MI6
311 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460
313 scratch - *pal-amdpal*
314 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150
315 flat - *pal-amdhsa* - FirePro S7100
316 scratch - *pal-amdpal* - FirePro W7100
319 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA*
321 scratch - *pal-amdpal* .. TODO::
326 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_ [AMD-GCN-GFX940-GFX942-CDNA3]_
327 -----------------------------------------------------------------------------------------------------------------------
328 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega
329 flat - *pal-amdhsa* Frontier Edition
330 scratch - *pal-amdpal* - Radeon RX Vega 56
334 - Radeon Instinct MI25
335 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G
336 flat - *pal-amdhsa* - Ryzen 5 2400G
337 scratch - *pal-amdpal*
338 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA*
340 - *pal-amdpal* .. TODO::
345 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50
346 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60
347 scratch - *pal-amdpal* - Radeon VII
349 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
353 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA*
360 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - AMD Instinct MI210 Accelerator
361 - tgsplit flat - *rocm-amdhsa* - AMD Instinct MI250 Accelerator
362 - xnack scratch - *rocm-amdhsa* - AMD Instinct MI250X Accelerator
363 - kernarg preload - Packed
367 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G
368 flat - Ryzen 7 4700GE
369 scratch - Ryzen 5 4600G
381 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
383 - xnack scratch .. TODO::
384 - kernarg preload - Packed
385 work-item Add product
388 ``gfx941`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
390 - xnack scratch .. TODO::
391 - kernarg preload - Packed
392 work-item Add product
395 ``gfx942`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
397 - xnack scratch .. TODO::
398 - kernarg preload - Packed
399 work-item Add product
402 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
403 -----------------------------------------------------------------------------------------------------------------------
404 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700
405 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT
406 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT
408 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520
409 - wavefrontsize64 - Absolute - *pal-amdhsa*
410 - xnack flat - *pal-amdpal*
412 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500
413 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT
414 - xnack scratch - *pal-amdpal*
415 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA*
416 - wavefrontsize64 flat - *pal-amdhsa*
417 - xnack scratch - *pal-amdpal* .. TODO::
422 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
423 -----------------------------------------------------------------------------------------------------------------------
424 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800
425 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT
426 scratch - *pal-amdpal* - Radeon RX 6900 XT
427 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT
428 - wavefrontsize64 flat - *pal-amdhsa*
429 scratch - *pal-amdpal*
430 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA*
431 - wavefrontsize64 flat - *pal-amdhsa*
432 scratch - *pal-amdpal* .. TODO::
437 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
438 - wavefrontsize64 flat
443 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA*
444 - wavefrontsize64 flat
450 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
451 - wavefrontsize64 flat
456 ``gfx1036`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
457 - wavefrontsize64 flat
463 **GCN GFX11 (RDNA 3)** [AMD-GCN-GFX11-RDNA3]_
464 -----------------------------------------------------------------------------------------------------------------------
465 ``gfx1100`` ``amdgcn`` dGPU - cumode - Architected - *pal-amdpal* *TBA*
466 - wavefrontsize64 flat
469 work-item Add product
472 ``gfx1101`` ``amdgcn`` dGPU - cumode - Architected *TBA*
473 - wavefrontsize64 flat
476 work-item Add product
479 ``gfx1102`` ``amdgcn`` dGPU - cumode - Architected *TBA*
480 - wavefrontsize64 flat
483 work-item Add product
486 ``gfx1103`` ``amdgcn`` APU - cumode - Architected *TBA*
487 - wavefrontsize64 flat
490 work-item Add product
493 ``gfx1150`` ``amdgcn`` APU - cumode - Architected *TBA*
494 - wavefrontsize64 flat
497 work-item Add product
500 ``gfx1151`` ``amdgcn`` APU - cumode - Architected *TBA*
501 - wavefrontsize64 flat
504 work-item Add product
507 ``gfx1200`` ``amdgcn`` dGPU - cumode - Architected *TBA*
508 - wavefrontsize64 flat
511 work-item Add product
514 ``gfx1201`` ``amdgcn`` dGPU - cumode - Architected *TBA*
515 - wavefrontsize64 flat
518 work-item Add product
521 =========== =============== ============ ===== ================= =============== =============== ======================
523 Generic processors allow execution of a single code object on any of the processors that
524 it supports. Such code objects may not perform as well as those for the non-generic processors.
526 Generic processors are only available on code object V6 and above (see :ref:`amdgpu-elf-code-object`).
528 Generic processor code objects are versioned. See :ref:`amdgpu-generic-processor-versioning` for more information on how versioning works.
530 .. table:: AMDGPU Generic Processors
531 :name: amdgpu-generic-processor-table
533 ==================== ============== ================= ================== ================= =================================
534 Processor Target Supported Target Features Target Properties Target Restrictions
535 Triple Processors Supported
538 ==================== ============== ================= ================== ================= =================================
539 ``gfx9-generic`` ``amdgcn`` - ``gfx900`` - xnack - Absolute flat - ``v_mad_mix`` instructions
540 - ``gfx902`` scratch are not available on
541 - ``gfx904`` ``gfx900``, ``gfx902``,
542 - ``gfx906`` ``gfx909``, ``gfx90c``
543 - ``gfx909`` - ``v_fma_mix`` instructions
544 - ``gfx90c`` are not available on ``gfx904``
545 - sramecc is not available on
547 - The following instructions
548 are not available on ``gfx906``:
561 ``gfx10-1-generic`` ``amdgcn`` - ``gfx1010`` - xnack - Absolute flat - The following instructions are
562 - ``gfx1011`` - wavefrontsize64 scratch not available on ``gfx1011``
563 - ``gfx1012`` - cumode and ``gfx1012``
569 - ``v_dot2c_f32_f16``
575 - BVH Ray Tracing instructions
580 ``gfx10-3-generic`` ``amdgcn`` - ``gfx1030`` - wavefrontsize64 - Absolute flat No restrictions.
581 - ``gfx1031`` - cumode scratch
589 ``gfx11-generic`` ``amdgcn`` - ``gfx1100`` - wavefrontsize64 - Architected Various codegen pessimizations
590 - ``gfx1101`` - cumode flat scratch are applied to work around some
591 - ``gfx1102`` - Packed hazards specific to some targets
592 - ``gfx1103`` work-item within this family.
594 - ``gfx1151`` Not all VGPRs can be used on:
600 SALU floating point instructions
601 and single-use VGPR hint
602 instructions are not available
608 SGPRs are not supported for src1
609 in dpp instructions for:
613 ==================== ============== ================= ================== ================= =================================
615 .. _amdgpu-generic-processor-versioning:
617 Generic Processor Versioning
618 ----------------------------
620 Generic processor (see :ref:`amdgpu-generic-processor-table`) code objects are versioned (see :ref:`amdgpu-elf-header-e_flags-table-v6-onwards`) between 1 and 255.
621 The version of non-generic code objects is always set to 0.
623 For a generic code object, adding a new supported processor may require the code generated for the generic target to be changed
624 so it can continue to execute on the previously supported processors as well as on the new one.
625 When this happens, the generic code object version number is incremented at the same time as the generic target is updated.
627 Each supported processor of a generic target is mapped to the version it was introduced in.
628 A generic code object can execute on a supported processor if the version of the code object being loaded is
629 greater than or equal to the version in which the processor was added to the generic target.
631 .. _amdgpu-target-features:
636 Target features control how code is generated to support certain
637 processor specific features. Not all target features are supported by
638 all processors. The runtime must ensure that the features supported by
639 the device used to execute the code match the features enabled when
640 generating the code. A mismatch of features may result in incorrect
641 execution, or a reduction in performance.
643 The target features supported by each processor is listed in
644 :ref:`amdgpu-processors`.
646 Target features are controlled by exactly one of the following Clang
649 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
651 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
652 optional components of the target ID. If omitted, the target feature has the
653 ``any`` value. See :ref:`amdgpu-target-id`.
655 ``-m[no-]<target-feature>``
657 Target features not specified by the target ID are specified using a
658 separate option. These target features can have an ``on`` or ``off``
659 value. ``on`` is specified by omitting the ``no-`` prefix, and
660 ``off`` is specified by including the ``no-`` prefix. The default
661 if not specified is ``off``.
665 ``-mcpu=gfx908:xnack+``
666 Enable the ``xnack`` feature.
667 ``-mcpu=gfx908:xnack-``
668 Disable the ``xnack`` feature.
670 Enable the ``cumode`` feature.
672 Disable the ``cumode`` feature.
674 .. table:: AMDGPU Target Features
675 :name: amdgpu-target-features-table
677 =============== ============================ ==================================================
678 Target Feature Clang Option to Control Description
680 =============== ============================ ==================================================
681 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
682 when generating code for kernels. When disabled
683 native WGP wavefront execution mode is used,
684 when enabled CU wavefront execution mode is used
685 (see :ref:`amdgpu-amdhsa-memory-model`).
687 sramecc - ``-mcpu`` If specified, generate code that can only be
688 - ``--offload-arch`` loaded and executed in a process that has a
689 matching setting for SRAMECC.
691 If not specified for code object V2 to V3, generate
692 code that can be loaded and executed in a process
693 with SRAMECC enabled.
695 If not specified for code object V4 or above, generate
696 code that can be loaded and executed in a process
697 with either setting of SRAMECC.
699 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
700 work-groups are launched in threadgroup split mode.
701 When enabled the waves of a work-group may be
702 launched in different CUs.
704 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
705 generating code for kernels. When disabled
706 native wavefront size 32 is used, when enabled
707 wavefront size 64 is used.
709 xnack - ``-mcpu`` If specified, generate code that can only be
710 - ``--offload-arch`` loaded and executed in a process that has a
711 matching setting for XNACK replay.
713 If not specified for code object V2 to V3, generate
714 code that can be loaded and executed in a process
715 with XNACK replay enabled.
717 If not specified for code object V4 or above, generate
718 code that can be loaded and executed in a process
719 with either setting of XNACK replay.
721 XNACK replay can be used for demand paging and
722 page migration. If enabled in the device, then if
723 a page fault occurs the code may execute
724 incorrectly unless generated with XNACK replay
725 enabled, or generated for code object V4 or above without
726 specifying XNACK replay. Executing code that was
727 generated with XNACK replay enabled, or generated
728 for code object V4 or above without specifying XNACK replay,
729 on a device that does not have XNACK replay
730 enabled will execute correctly but may be less
731 performant than code generated for XNACK replay
733 =============== ============================ ==================================================
735 .. _amdgpu-target-id:
740 AMDGPU supports target IDs. See `Clang Offload Bundler
741 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
742 description. The AMDGPU target specific information is:
745 Is an AMDGPU processor or alternative processor name specified in
746 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
747 the primary processor and alternative processor names. The canonical form
748 target ID only allow the primary processor name.
751 Is a target feature name specified in :ref:`amdgpu-target-features-table` that
752 is supported by the processor. The target features supported by each processor
753 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
754 a target ID are marked as being controlled by ``-mcpu`` and
755 ``--offload-arch``. Each target feature must appear at most once in a target
756 ID. The non-canonical form target ID allows the target features to be
757 specified in any order. The canonical form target ID requires the target
758 features to be specified in alphabetic order.
760 .. _amdgpu-target-id-v2-v3:
762 Code Object V2 to V3 Target ID
763 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
765 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
766 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
767 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
768 directive and the bundle entry ID. In those cases it has the following BNF
773 <target-id> ::== <processor> ( "+" <target-feature> )*
775 Where a target feature is omitted if *Off* and present if *On* or *Any*.
779 The code object V2 to V3 cannot represent *Any* and treats it the same as
782 .. _amdgpu-embedding-bundled-objects:
784 Embedding Bundled Code Objects
785 ------------------------------
787 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
788 as described in `Clang Offload Bundler
789 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
793 The target ID syntax used for code object V2 to V3 for a bundle entry ID
794 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
796 .. _amdgpu-address-spaces:
801 The AMDGPU architecture supports a number of memory address spaces. The address
802 space names use the OpenCL standard names, with some additions.
804 The AMDGPU address spaces correspond to target architecture specific LLVM
805 address space numbers used in LLVM IR.
807 The AMDGPU address spaces are described in
808 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
809 supported for the ``amdgcn`` target.
811 .. table:: AMDGPU Address Spaces
812 :name: amdgpu-address-spaces-table
814 ===================================== =============== =========== ================ ======= ============================
815 .. 64-Bit Process Address Space
816 ------------------------------------- --------------- ----------- ---------------- ------------------------------------
817 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value
818 Space Number Name Name Size
819 ===================================== =============== =========== ================ ======= ============================
820 Generic 0 flat flat 64 0x0000000000000000
821 Global 1 global global 64 0x0000000000000000
822 Region 2 N/A GDS 32 *not implemented for AMDHSA*
823 Local 3 group LDS 32 0xFFFFFFFF
824 Constant 4 constant *same as global* 64 0x0000000000000000
825 Private 5 private scratch 32 0xFFFFFFFF
826 Constant 32-bit 6 *TODO* 0x00000000
827 Buffer Fat Pointer 7 N/A N/A 160 0
828 Buffer Resource 8 N/A V# 128 0x00000000000000000000000000000000
829 Buffer Strided Pointer (experimental) 9 *TODO*
830 Streamout Registers 128 N/A GS_REGS
831 ===================================== =============== =========== ================ ======= ============================
834 The generic address space is supported unless the *Target Properties* column
835 of :ref:`amdgpu-processor-table` specifies *Does not support generic address
838 The generic address space uses the hardware flat address support for two fixed
839 ranges of virtual addresses (the private and local apertures), that are
840 outside the range of addressable global memory, to map from a flat address to
841 a private or local address. This uses FLAT instructions that can take a flat
842 address and access global, private (scratch), and group (LDS) memory depending
843 on if the address is within one of the aperture ranges.
845 Flat access to scratch requires hardware aperture setup and setup in the
846 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
847 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
848 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
850 To convert between a private or group address space address (termed a segment
851 address) and a flat address the base address of the corresponding aperture
852 can be used. For GFX7-GFX8 these are available in the
853 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
854 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
855 GFX9-GFX11 the aperture base addresses are directly available as inline
856 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
857 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
858 aligned to 2^32 which makes it easier to convert from flat to segment or
861 A global address space address has the same value when used as a flat address
862 so no conversion is needed.
864 **Global and Constant**
865 The global and constant address spaces both use global virtual addresses,
866 which are the same virtual address space used by the CPU. However, some
867 virtual addresses may only be accessible to the CPU, some only accessible
868 by the GPU, and some by both.
870 Using the constant address space indicates that the data will not change
871 during the execution of the kernel. This allows scalar read instructions to
872 be used. As the constant address space could only be modified on the host
873 side, a generic pointer loaded from the constant address space is safe to be
874 assumed as a global pointer since only the device global memory is visible
875 and managed on the host side. The vector and scalar L1 caches are invalidated
876 of volatile data before each kernel dispatch execution to allow constant
877 memory to change values between kernel dispatches.
880 The region address space uses the hardware Global Data Store (GDS). All
881 wavefronts executing on the same device will access the same memory for any
882 given region address. However, the same region address accessed by wavefronts
883 executing on different devices will access different memory. It is higher
884 performance than global memory. It is allocated by the runtime. The data
885 store (DS) instructions can be used to access it.
888 The local address space uses the hardware Local Data Store (LDS) which is
889 automatically allocated when the hardware creates the wavefronts of a
890 work-group, and freed when all the wavefronts of a work-group have
891 terminated. All wavefronts belonging to the same work-group will access the
892 same memory for any given local address. However, the same local address
893 accessed by wavefronts belonging to different work-groups will access
894 different memory. It is higher performance than global memory. The data store
895 (DS) instructions can be used to access it.
898 The private address space uses the hardware scratch memory support which
899 automatically allocates memory when it creates a wavefront and frees it when
900 a wavefronts terminates. The memory accessed by a lane of a wavefront for any
901 given private address will be different to the memory accessed by another lane
902 of the same or different wavefront for the same private address.
904 If a kernel dispatch uses scratch, then the hardware allocates memory from a
905 pool of backing memory allocated by the runtime for each wavefront. The lanes
906 of the wavefront access this using dword (4 byte) interleaving. The mapping
907 used from private address to backing memory address is:
909 ``wavefront-scratch-base +
910 ((private-address / 4) * wavefront-size * 4) +
911 (wavefront-lane-id * 4) + (private-address % 4)``
913 If each lane of a wavefront accesses the same private address, the
914 interleaving results in adjacent dwords being accessed and hence requires
915 fewer cache lines to be fetched.
917 There are different ways that the wavefront scratch base address is
918 determined by a wavefront (see
919 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
921 Scratch memory can be accessed in an interleaved manner using buffer
922 instructions with the scratch buffer descriptor and per wavefront scratch
923 offset, by the scratch instructions, or by flat instructions. Multi-dword
924 access is not supported except by flat and scratch instructions in
927 Code that manipulates the stack values in other lanes of a wavefront,
928 such as by ``addrspacecast``-ing stack pointers to generic ones and taking offsets
929 that reach other lanes or by explicitly constructing the scratch buffer descriptor,
930 triggers undefined behavior when it modifies the scratch values of other lanes.
931 The compiler may assume that such modifications do not occur.
932 When using code object V5 ``LIBOMPTARGET_STACK_SIZE`` may be used to provide the
933 private segment size in bytes, for cases where a dynamic stack is used.
938 **Buffer Fat Pointer**
939 The buffer fat pointer is an experimental address space that is currently
940 unsupported in the backend. It exposes a non-integral pointer that is in
941 the future intended to support the modelling of 128-bit buffer descriptors
942 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
943 *pointer*), allowing normal LLVM load/store/atomic operations to be used to
944 model the buffer descriptors used heavily in graphics workloads targeting
947 The buffer descriptor used to construct a buffer fat pointer must be *raw*:
948 the stride must be 0, the "add tid" flag must be 0, the swizzle enable bits
949 must be off, and the extent must be measured in bytes. (On subtargets where
950 bounds checking may be disabled, buffer fat pointers may choose to enable
954 The buffer resource pointer, in address space 8, is the newer form
955 for representing buffer descriptors in AMDGPU IR, replacing their
956 previous representation as `<4 x i32>`. It is a non-integral pointer
957 that represents a 128-bit buffer descriptor resource (`V#`).
959 Since, in general, a buffer resource supports complex addressing modes that cannot
960 be easily represented in LLVM (such as implicit swizzled access to structured
961 buffers), it is **illegal** to perform non-trivial address computations, such as
962 ``getelementptr`` operations, on buffer resources. They may be passed to
963 AMDGPU buffer intrinsics, and they may be converted to and from ``i128``.
965 Casting a buffer resource to a buffer fat pointer is permitted and adds an offset
968 Buffer resources can be created from 64-bit pointers (which should be either
969 generic or global) using the `llvm.amdgcn.make.buffer.rsrc` intrinsic, which
970 takes the pointer, which becomes the base of the resource,
971 the 16-bit stride (and swzizzle control) field stored in bits `63:48` of a `V#`,
972 the 32-bit NumRecords/extent field (bits `95:64`), and the 32-bit flags field
973 (bits `127:96`). The specific interpretation of these fields varies by the
974 target architecture and is detailed in the ISA descriptions.
976 **Buffer Strided Pointer**
977 The buffer index pointer is an experimental address space. It represents
978 a 128-bit buffer descriptor and a 32-bit offset, like the **Buffer Fat
979 Pointer**. Additionally, it contains an index into the buffer, which
980 allows the direct addressing of structured elements. These components appear
981 in that order, i.e., the descriptor comes first, then the 32-bit offset
982 followed by the 32-bit index.
984 The bits in the buffer descriptor must meet the following requirements:
985 the stride is the size of a structured element, the "add tid" flag must be 0,
986 and the swizzle enable bits must be off.
988 **Streamout Registers**
989 Dedicated registers used by the GS NGG Streamout Instructions. The register
990 file is modelled as a memory in a distinct address space because it is indexed
991 by an address-like offset in place of named registers, and because register
992 accesses affect LGKMcnt. This is an internal address space used only by the
993 compiler. Do not use this address space for IR pointers.
995 .. _amdgpu-memory-scopes:
1000 This section provides LLVM memory synchronization scopes supported by the AMDGPU
1001 backend memory model when the target triple OS is ``amdhsa`` (see
1002 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
1004 The memory model supported is based on the HSA memory model [HSA]_ which is
1005 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
1006 relation is transitive over the synchronizes-with relation independent of scope
1007 and synchronizes-with allows the memory scope instances to be inclusive (see
1008 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
1010 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
1011 inclusion and requires the memory scopes to exactly match. However, this
1012 is conservatively correct for OpenCL.
1014 .. table:: AMDHSA LLVM Sync Scopes
1015 :name: amdgpu-amdhsa-llvm-sync-scopes-table
1017 ======================= ===================================================
1018 LLVM Sync Scope Description
1019 ======================= ===================================================
1020 *none* The default: ``system``.
1022 Synchronizes with, and participates in modification
1023 and seq_cst total orderings with, other operations
1024 (except image operations) for all address spaces
1025 (except private, or generic that accesses private)
1026 provided the other operation's sync scope is:
1029 - ``agent`` and executed by a thread on the same
1031 - ``workgroup`` and executed by a thread in the
1033 - ``wavefront`` and executed by a thread in the
1036 ``agent`` Synchronizes with, and participates in modification
1037 and seq_cst total orderings with, other operations
1038 (except image operations) for all address spaces
1039 (except private, or generic that accesses private)
1040 provided the other operation's sync scope is:
1042 - ``system`` or ``agent`` and executed by a thread
1044 - ``workgroup`` and executed by a thread in the
1046 - ``wavefront`` and executed by a thread in the
1049 ``workgroup`` Synchronizes with, and participates in modification
1050 and seq_cst total orderings with, other operations
1051 (except image operations) for all address spaces
1052 (except private, or generic that accesses private)
1053 provided the other operation's sync scope is:
1055 - ``system``, ``agent`` or ``workgroup`` and
1056 executed by a thread in the same work-group.
1057 - ``wavefront`` and executed by a thread in the
1060 ``wavefront`` Synchronizes with, and participates in modification
1061 and seq_cst total orderings with, other operations
1062 (except image operations) for all address spaces
1063 (except private, or generic that accesses private)
1064 provided the other operation's sync scope is:
1066 - ``system``, ``agent``, ``workgroup`` or
1067 ``wavefront`` and executed by a thread in the
1070 ``singlethread`` Only synchronizes with and participates in
1071 modification and seq_cst total orderings with,
1072 other operations (except image operations) running
1073 in the same thread for all address spaces (for
1074 example, in signal handlers).
1076 ``one-as`` Same as ``system`` but only synchronizes with other
1077 operations within the same address space.
1079 ``agent-one-as`` Same as ``agent`` but only synchronizes with other
1080 operations within the same address space.
1082 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with
1083 other operations within the same address space.
1085 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with
1086 other operations within the same address space.
1088 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
1089 other operations within the same address space.
1090 ======================= ===================================================
1095 The AMDGPU backend implements the following LLVM IR intrinsics.
1097 *This section is WIP.*
1099 .. table:: AMDGPU LLVM IR Intrinsics
1100 :name: amdgpu-llvm-ir-intrinsics-table
1102 ============================================== ==========================================================
1103 LLVM Intrinsic Description
1104 ============================================== ==========================================================
1105 llvm.amdgcn.sqrt Provides direct access to v_sqrt_f64, v_sqrt_f32 and v_sqrt_f16
1106 (on targets with half support). Performs sqrt function.
1108 llvm.amdgcn.log Provides direct access to v_log_f32 and v_log_f16
1109 (on targets with half support). Performs log2 function.
1111 llvm.amdgcn.exp2 Provides direct access to v_exp_f32 and v_exp_f16
1112 (on targets with half support). Performs exp2 function.
1114 :ref:`llvm.frexp <int_frexp>` Implemented for half, float and double.
1116 :ref:`llvm.log2 <int_log2>` Implemented for float and half (and vectors of float or
1117 half). Not implemented for double. Hardware provides
1118 1ULP accuracy for float, and 0.51ULP for half. Float
1119 instruction does not natively support denormal
1122 :ref:`llvm.sqrt <int_sqrt>` Implemented for double, float and half (and vectors).
1124 :ref:`llvm.log <int_log>` Implemented for float and half (and vectors).
1126 :ref:`llvm.exp <int_exp>` Implemented for float and half (and vectors).
1128 :ref:`llvm.log10 <int_log10>` Implemented for float and half (and vectors).
1130 :ref:`llvm.exp2 <int_exp2>` Implemented for float and half (and vectors of float or
1131 half). Not implemented for double. Hardware provides
1132 1ULP accuracy for float, and 0.51ULP for half. Float
1133 instruction does not natively support denormal
1136 :ref:`llvm.stacksave.p5 <int_stacksave>` Implemented, must use the alloca address space.
1137 :ref:`llvm.stackrestore.p5 <int_stackrestore>` Implemented, must use the alloca address space.
1139 :ref:`llvm.get.fpmode.i32 <int_get_fpmode>` The natural floating-point mode type is i32. This
1140 implemented by extracting relevant bits out of the MODE
1141 register with s_getreg_b32. The first 10 bits are the
1142 core floating-point mode. Bits 12:18 are the exception
1143 mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not
1144 relevant to floating-point instructions are 0s.
1146 :ref:`llvm.get.rounding<int_get_rounding>` AMDGPU supports two separately controllable rounding
1147 modes depending on the floating-point type. One
1148 controls float, and the other controls both double and
1149 half operations. If both modes are the same, returns
1150 one of the standard return values. If the modes are
1151 different, returns one of :ref:`12 extended values
1152 <amdgpu-rounding-mode-enumeration-values-table>`
1153 describing the two modes.
1155 To nearest, ties away from zero is not a supported
1156 mode. The raw rounding mode values in the MODE
1157 register do not exactly match the FLT_ROUNDS values,
1158 so a conversion is performed.
1160 :ref:`llvm.set.rounding<int_set_rounding>` Input value expected to be one of the valid results
1161 from '``llvm.get.rounding``'. Rounding mode is
1162 undefined if not passed a valid input. This should be
1163 a wave uniform value. In case of a divergent input
1164 value, the first active lane's value will be used.
1166 :ref:`llvm.get.fpenv<int_get_fpenv>` Returns the current value of the AMDGPU floating point environment.
1167 This stores information related to the current rounding mode,
1168 denormalization mode, enabled traps, and floating point exceptions.
1169 The format is a 64-bit concatenation of the MODE and TRAPSTS registers.
1171 :ref:`llvm.set.fpenv<int_set_fpenv>` Sets the floating point environment to the specifies state.
1173 llvm.amdgcn.wave.reduce.umin Performs an arithmetic unsigned min reduction on the unsigned values
1174 provided by each lane in the wavefront.
1175 Intrinsic takes a hint for reduction strategy using second operand
1176 0: Target default preference,
1177 1: `Iterative strategy`, and
1179 If target does not support the DPP operations (e.g. gfx6/7),
1180 reduction will be performed using default iterative strategy.
1181 Intrinsic is currently only implemented for i32.
1183 llvm.amdgcn.wave.reduce.umax Performs an arithmetic unsigned max reduction on the unsigned values
1184 provided by each lane in the wavefront.
1185 Intrinsic takes a hint for reduction strategy using second operand
1186 0: Target default preference,
1187 1: `Iterative strategy`, and
1189 If target does not support the DPP operations (e.g. gfx6/7),
1190 reduction will be performed using default iterative strategy.
1191 Intrinsic is currently only implemented for i32.
1193 llvm.amdgcn.udot2 Provides direct access to v_dot2_u32_u16 across targets which
1194 support such instructions. This performs unsigned dot product
1195 with two v2i16 operands, summed with the third i32 operand. The
1196 i1 fourth operand is used to clamp the output.
1198 llvm.amdgcn.udot4 Provides direct access to v_dot4_u32_u8 across targets which
1199 support such instructions. This performs unsigned dot product
1200 with two i32 operands (holding a vector of 4 8bit values), summed
1201 with the third i32 operand. The i1 fourth operand is used to clamp
1204 llvm.amdgcn.udot8 Provides direct access to v_dot8_u32_u4 across targets which
1205 support such instructions. This performs unsigned dot product
1206 with two i32 operands (holding a vector of 8 4bit values), summed
1207 with the third i32 operand. The i1 fourth operand is used to clamp
1210 llvm.amdgcn.sdot2 Provides direct access to v_dot2_i32_i16 across targets which
1211 support such instructions. This performs signed dot product
1212 with two v2i16 operands, summed with the third i32 operand. The
1213 i1 fourth operand is used to clamp the output.
1214 When applicable (e.g. no clamping), this is lowered into
1215 v_dot2c_i32_i16 for targets which support it.
1217 llvm.amdgcn.sdot4 Provides direct access to v_dot4_i32_i8 across targets which
1218 support such instructions. This performs signed dot product
1219 with two i32 operands (holding a vector of 4 8bit values), summed
1220 with the third i32 operand. The i1 fourth operand is used to clamp
1222 When applicable (i.e. no clamping / operand modifiers), this is lowered
1223 into v_dot4c_i32_i8 for targets which support it.
1224 RDNA3 does not offer v_dot4_i32_i8, and rather offers
1225 v_dot4_i32_iu8 which has operands to hold the signedness of the
1226 vector operands. Thus, this intrinsic lowers to the signed version
1227 of this instruction for gfx11 targets.
1229 llvm.amdgcn.sdot8 Provides direct access to v_dot8_u32_u4 across targets which
1230 support such instructions. This performs signed dot product
1231 with two i32 operands (holding a vector of 8 4bit values), summed
1232 with the third i32 operand. The i1 fourth operand is used to clamp
1234 When applicable (i.e. no clamping / operand modifiers), this is lowered
1235 into v_dot8c_i32_i4 for targets which support it.
1236 RDNA3 does not offer v_dot8_i32_i4, and rather offers
1237 v_dot4_i32_iu4 which has operands to hold the signedness of the
1238 vector operands. Thus, this intrinsic lowers to the signed version
1239 of this instruction for gfx11 targets.
1241 llvm.amdgcn.sudot4 Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs
1242 dot product with two i32 operands (holding a vector of 4 8bit values), summed
1243 with the fifth i32 operand. The i1 sixth operand is used to clamp
1244 the output. The i1s preceding the vector operands decide the signedness.
1246 llvm.amdgcn.sudot8 Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs
1247 dot product with two i32 operands (holding a vector of 8 4bit values), summed
1248 with the fifth i32 operand. The i1 sixth operand is used to clamp
1249 the output. The i1s preceding the vector operands decide the signedness.
1251 llvm.amdgcn.sched_barrier Controls the types of instructions that may be allowed to cross the intrinsic
1252 during instruction scheduling. The parameter is a mask for the instruction types
1253 that can cross the intrinsic.
1255 - 0x0000: No instructions may be scheduled across sched_barrier.
1256 - 0x0001: All, non-memory, non-side-effect producing instructions may be
1257 scheduled across sched_barrier, *i.e.* allow ALU instructions to pass.
1258 - 0x0002: VALU instructions may be scheduled across sched_barrier.
1259 - 0x0004: SALU instructions may be scheduled across sched_barrier.
1260 - 0x0008: MFMA/WMMA instructions may be scheduled across sched_barrier.
1261 - 0x0010: All VMEM instructions may be scheduled across sched_barrier.
1262 - 0x0020: VMEM read instructions may be scheduled across sched_barrier.
1263 - 0x0040: VMEM write instructions may be scheduled across sched_barrier.
1264 - 0x0080: All DS instructions may be scheduled across sched_barrier.
1265 - 0x0100: All DS read instructions may be scheduled accoss sched_barrier.
1266 - 0x0200: All DS write instructions may be scheduled across sched_barrier.
1267 - 0x0400: All Transcendental (e.g. V_EXP) instructions may be scheduled across sched_barrier.
1269 llvm.amdgcn.sched_group_barrier Creates schedule groups with specific properties to create custom scheduling
1270 pipelines. The ordering between groups is enforced by the instruction scheduler.
1271 The intrinsic applies to the code that preceeds the intrinsic. The intrinsic
1272 takes three values that control the behavior of the schedule groups.
1274 - Mask : Classify instruction groups using the llvm.amdgcn.sched_barrier mask values.
1275 - Size : The number of instructions that are in the group.
1276 - SyncID : Order is enforced between groups with matching values.
1278 The mask can include multiple instruction types. It is undefined behavior to set
1279 values beyond the range of valid masks.
1281 Combining multiple sched_group_barrier intrinsics enables an ordering of specific
1282 instruction types during instruction scheduling. For example, the following enforces
1283 a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA
1286 | ``// 1 VMEM read``
1287 | ``__builtin_amdgcn_sched_group_barrier(32, 1, 0)``
1289 | ``__builtin_amdgcn_sched_group_barrier(2, 1, 0)``
1291 | ``__builtin_amdgcn_sched_group_barrier(8, 5, 0)``
1293 llvm.amdgcn.iglp_opt An **experimental** intrinsic for instruction group level parallelism. The intrinsic
1294 implements predefined intruction scheduling orderings. The intrinsic applies to the
1295 surrounding scheduling region. The intrinsic takes a value that specifies the
1296 strategy. The compiler implements two strategies.
1298 0. Interleave DS and MFMA instructions for small GEMM kernels.
1299 1. Interleave DS and MFMA instructions for single wave small GEMM kernels.
1301 Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic
1302 cannot be combined with sched_barrier or sched_group_barrier.
1304 The iglp_opt strategy implementations are subject to change.
1306 llvm.amdgcn.atomic.cond.sub.u32 Provides direct access to flat_atomic_cond_sub_u32, global_atomic_cond_sub_u32
1307 and ds_cond_sub_u32 based on address space on gfx12 targets. This
1308 performs subtraction only if the memory value is greater than or
1309 equal to the data value.
1311 llvm.amdgcn.s.getpc Provides access to the s_getpc_b64 instruction, but with the return value
1312 sign-extended from the width of the underlying PC hardware register even on
1313 processors where the s_getpc_b64 instruction returns a zero-extended value.
1315 ============================================== ==========================================================
1319 List AMDGPU intrinsics.
1321 .. _amdgpu_metadata:
1326 The AMDGPU backend implements the following target custom LLVM IR
1329 .. _amdgpu_last_use:
1331 '``amdgpu.last.use``' Metadata
1332 ------------------------------
1334 Sets TH_LOAD_LU temporal hint on load instructions that support it.
1335 Takes priority over nontemporal hint (TH_LOAD_NT). This takes no
1338 .. code-block:: llvm
1340 %val = load i32, ptr %in, align 4, !amdgpu.last.use !{}
1346 The AMDGPU backend supports the following LLVM IR attributes.
1348 .. table:: AMDGPU LLVM IR Attributes
1349 :name: amdgpu-llvm-ir-attributes-table
1351 ======================================= ==========================================================
1352 LLVM Attribute Description
1353 ======================================= ==========================================================
1354 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
1355 will be specified when the kernel is dispatched. Generated
1356 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
1357 The IR implied default value is 1,1024. Clang may emit this attribute
1358 with more restrictive bounds depending on language defaults.
1359 If the actual block or workgroup size exceeds the limit at any point during
1360 the execution, the behavior is undefined. For example, even if there is
1361 only one active thread but the thread local id exceeds the limit, the
1362 behavior is undefined.
1364 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
1365 argument block size for the implicit arguments. This
1366 varies by OS and language (for OpenCL see
1367 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
1368 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
1369 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
1370 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
1371 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
1372 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
1373 execution unit. Generated by the ``amdgpu_waves_per_eu``
1374 CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
1375 and the backend may not be able to satisfy the request. If
1376 the specified range is incompatible with the function's
1377 "amdgpu-flat-work-group-size" value, the implied occupancy
1378 bounds by the workgroup size takes precedence.
1380 "amdgpu-ieee" true/false. GFX6-GFX11 Only
1381 Specify whether the function expects the IEEE field of the
1382 mode register to be set on entry. Overrides the default for
1383 the calling convention.
1384 "amdgpu-dx10-clamp" true/false. GFX6-GFX11 Only
1385 Specify whether the function expects the DX10_CLAMP field of
1386 the mode register to be set on entry. Overrides the default
1387 for the calling convention.
1389 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the
1390 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
1391 attribute, or reached through a call site marked with this attribute,
1392 the value returned by the intrinsic is undefined. The backend can
1393 generally infer this during code generation, so typically there is no
1394 benefit to frontends marking functions with this.
1396 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the
1397 llvm.amdgcn.workitem.id.y intrinsic.
1399 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the
1400 llvm.amdgcn.workitem.id.z intrinsic.
1402 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the
1403 llvm.amdgcn.workgroup.id.x intrinsic.
1405 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the
1406 llvm.amdgcn.workgroup.id.y intrinsic.
1408 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the
1409 llvm.amdgcn.workgroup.id.z intrinsic.
1411 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the
1412 llvm.amdgcn.dispatch.ptr intrinsic.
1414 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the
1415 llvm.amdgcn.implicitarg.ptr intrinsic.
1417 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the
1418 llvm.amdgcn.dispatch.id intrinsic.
1420 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the
1421 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
1422 attributes, the queue pointer may be required in situations where the
1423 intrinsic call does not directly appear in the program. Some subtargets
1424 require the queue pointer for to handle some addrspacecasts, as well
1425 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
1426 llvm.debug intrinsics.
1428 "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1429 kernel argument that holds the pointer to the hostcall buffer. If this
1430 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1432 "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1433 kernel argument that holds the pointer to an initialized memory buffer
1434 that conforms to the requirements of the malloc/free device library V1
1435 version implementation. If this attribute is absent, then the
1436 amdgpu-no-implicitarg-ptr is also removed.
1438 "amdgpu-no-multigrid-sync-arg" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1439 kernel argument that holds the multigrid synchronization pointer. If this
1440 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1442 "amdgpu-no-default-queue" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1443 kernel argument that holds the default queue pointer. If this
1444 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1446 "amdgpu-no-completion-action" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1447 kernel argument that holds the completion action pointer. If this
1448 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1450 "amdgpu-lds-size"="min[,max]" Min is the minimum number of bytes that will be allocated in the Local
1451 Data Store at address zero. Variables are allocated within this frame
1452 using absolute symbol metadata, primarily by the AMDGPULowerModuleLDS
1453 pass. Optional max is the maximum number of bytes that will be allocated.
1454 Note that min==max indicates that no further variables can be added to
1455 the frame. This is an internal detail of how LDS variables are lowered,
1456 language front ends should not set this attribute.
1458 "amdgpu-gds-size" Bytes expected to be allocated at the start of GDS memory at entry.
1460 "amdgpu-git-ptr-high" The hard-wired high half of the address of the global information table
1461 for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since
1462 current hardware only allows a 16 bit value.
1464 "amdgpu-32bit-address-high-bits" Assumed high 32-bits for 32-bit address spaces which are really truncated
1465 64-bit addresses (i.e., addrspace(6))
1467 "amdgpu-color-export" Indicates shader exports color information if set to 1.
1468 Defaults to 1 for :ref:`amdgpu_ps <amdgpu-cc>`, and 0 for other calling
1469 conventions. Determines the necessity and type of null exports when a shader
1470 terminates early by killing lanes.
1472 "amdgpu-depth-export" Indicates shader exports depth information if set to 1. Determines the
1473 necessity and type of null exports when a shader terminates early by killing
1474 lanes. A depth-only shader will export to depth channel when no null export
1475 target is available (GFX11+).
1477 "InitialPSInputAddr" Set the initial value of the `spi_ps_input_addr` register for
1478 :ref:`amdgpu_ps <amdgpu-cc>` shaders. Any bits enabled by this value will
1479 be enabled in the final register value.
1481 "amdgpu-wave-priority-threshold" VALU instruction count threshold for adjusting wave priority. If exceeded,
1482 temporarily raise the wave priority at the start of the shader function
1483 until its last VMEM instructions to allow younger waves to issue their VMEM
1484 instructions as well.
1486 "amdgpu-memory-bound" Set internally by backend
1488 "amdgpu-wave-limiter" Set internally by backend
1490 "amdgpu-unroll-threshold" Set base cost threshold preference for loop unrolling within this function,
1491 default is 300. Actual threshold may be varied by per-loop metadata or
1492 reduced by heuristics.
1494 "amdgpu-max-num-workgroups"="x,y,z" Specify the maximum number of work groups for the kernel dispatch in the
1495 X, Y, and Z dimensions. Generated by the ``amdgpu_max_num_work_groups``
1496 CLANG attribute [CLANG-ATTR]_. Clang only emits this attribute when all
1497 the three numbers are >= 1.
1499 "amdgpu-no-agpr" Indicates the function will not require allocating AGPRs. This is only
1500 relevant on subtargets with AGPRs. The behavior is undefined if a
1501 function which requires AGPRs is reached through any function marked
1502 with this attribute.
1504 ======================================= ==========================================================
1509 The AMDGPU backend supports the following calling conventions:
1511 .. table:: AMDGPU Calling Conventions
1514 =============================== ==========================================================
1515 Calling Convention Description
1516 =============================== ==========================================================
1517 ``ccc`` The C calling convention. Used by default.
1518 See :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions`
1521 ``fastcc`` The fast calling convention. Mostly the same as the ``ccc``.
1523 ``coldcc`` The cold calling convention. Mostly the same as the ``ccc``.
1525 ``amdgpu_cs`` Used for Mesa/AMDPAL compute shaders.
1529 ``amdgpu_cs_chain`` Similar to ``amdgpu_cs``, with differences described below.
1531 Functions with this calling convention cannot be called directly. They must
1532 instead be launched via the ``llvm.amdgcn.cs.chain`` intrinsic.
1534 Arguments are passed in SGPRs, starting at s0, if they have the ``inreg``
1535 attribute, and in VGPRs otherwise, starting at v8. Using more SGPRs or VGPRs
1536 than available in the subtarget is not allowed. On subtargets that use
1537 a scratch buffer descriptor (as opposed to ``scratch_{load,store}_*`` instructions),
1538 the scratch buffer descriptor is passed in s[48:51]. This limits the
1539 SGPR / ``inreg`` arguments to the equivalent of 48 dwords; using more
1540 than that is not allowed.
1542 The return type must be void.
1543 Varargs, sret, byval, byref, inalloca, preallocated are not supported.
1545 Values in scalar registers as well as v0-v7 are not preserved. Values in
1546 VGPRs starting at v8 are not preserved for the active lanes, but must be
1547 saved by the callee for inactive lanes when using WWM.
1549 Wave scratch is "empty" at function boundaries. There is no stack pointer input
1550 or output value, but functions are free to use scratch starting from an initial
1551 stack pointer. Calls to ``amdgpu_gfx`` functions are allowed and behave like they
1552 do in ``amdgpu_cs`` functions.
1554 All counters (``lgkmcnt``, ``vmcnt``, ``storecnt``, etc.) are presumed in an
1555 unknown state at function entry.
1557 A function may have multiple exits (e.g. one chain exit and one plain ``ret void``
1558 for when the wave ends), but all ``llvm.amdgcn.cs.chain`` exits must be in
1559 uniform control flow.
1561 ``amdgpu_cs_chain_preserve`` Same as ``amdgpu_cs_chain``, but active lanes for VGPRs starting at v8 are preserved.
1562 Calls to ``amdgpu_gfx`` functions are not allowed, and any calls to ``llvm.amdgcn.cs.chain``
1563 must not pass more VGPR arguments than the caller's VGPR function parameters.
1565 ``amdgpu_es`` Used for AMDPAL shader stage before geometry shader if geometry is in
1566 use. So either the domain (= tessellation evaluation) shader if
1567 tessellation is in use, or otherwise the vertex shader.
1571 ``amdgpu_gfx`` Used for AMD graphics targets. Functions with this calling convention
1572 cannot be used as entry points.
1576 ``amdgpu_gs`` Used for Mesa/AMDPAL geometry shaders.
1580 ``amdgpu_hs`` Used for Mesa/AMDPAL hull shaders (= tessellation control shaders).
1584 ``amdgpu_kernel`` See :ref:`amdgpu-amdhsa-function-call-convention-kernel-functions`
1586 ``amdgpu_ls`` Used for AMDPAL vertex shader if tessellation is in use.
1590 ``amdgpu_ps`` Used for Mesa/AMDPAL pixel shaders.
1594 ``amdgpu_vs`` Used for Mesa/AMDPAL last shader stage before rasterization (vertex
1595 shader if tessellation and geometry are not in use, or otherwise
1596 copy shader if one is needed).
1600 =============================== ==========================================================
1605 As part of the AMDGPU MC layer, AMDGPU provides the following target specific
1608 .. table:: AMDGPU MCExpr types:
1609 :name: amdgpu-mcexpr-table
1611 =================== ================= ========================================================
1612 MCExpr Operands Return value
1613 =================== ================= ========================================================
1614 ``max(arg, ...)`` 1 or more Variadic signed operation that returns the maximum
1615 value of all its arguments.
1617 ``or(arg, ...)`` 1 or more Variadic signed operation that returns the bitwise-or
1618 result of all its arguments.
1620 =================== ================= ========================================================
1622 .. _amdgpu-elf-code-object:
1627 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
1628 can be linked by ``lld`` to produce a standard ELF shared code object which can
1629 be loaded and executed on an AMDGPU target.
1631 .. _amdgpu-elf-header:
1636 The AMDGPU backend uses the following ELF header:
1638 .. table:: AMDGPU ELF Header
1639 :name: amdgpu-elf-header-table
1641 ========================== ===============================
1643 ========================== ===============================
1644 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
1645 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
1646 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
1647 - ``ELFOSABI_AMDGPU_HSA``
1648 - ``ELFOSABI_AMDGPU_PAL``
1649 - ``ELFOSABI_AMDGPU_MESA3D``
1650 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1651 - ``ELFABIVERSION_AMDGPU_HSA_V3``
1652 - ``ELFABIVERSION_AMDGPU_HSA_V4``
1653 - ``ELFABIVERSION_AMDGPU_HSA_V5``
1654 - ``ELFABIVERSION_AMDGPU_HSA_V6``
1655 - ``ELFABIVERSION_AMDGPU_PAL``
1656 - ``ELFABIVERSION_AMDGPU_MESA3D``
1657 ``e_type`` - ``ET_REL``
1659 ``e_machine`` ``EM_AMDGPU``
1661 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1662 :ref:`amdgpu-elf-header-e_flags-table-v3`,
1663 :ref:`amdgpu-elf-header-e_flags-table-v4-v5`,
1664 and :ref:`amdgpu-elf-header-e_flags-table-v6-onwards`
1665 ========================== ===============================
1669 .. table:: AMDGPU ELF Header Enumeration Values
1670 :name: amdgpu-elf-header-enumeration-values-table
1672 =============================== =====
1674 =============================== =====
1677 ``ELFOSABI_AMDGPU_HSA`` 64
1678 ``ELFOSABI_AMDGPU_PAL`` 65
1679 ``ELFOSABI_AMDGPU_MESA3D`` 66
1680 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1681 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1682 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1683 ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1684 ``ELFABIVERSION_AMDGPU_HSA_V6`` 4
1685 ``ELFABIVERSION_AMDGPU_PAL`` 0
1686 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1687 =============================== =====
1689 ``e_ident[EI_CLASS]``
1692 * ``ELFCLASS32`` for ``r600`` architecture.
1694 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1695 process address space applications.
1697 ``e_ident[EI_DATA]``
1698 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1700 ``e_ident[EI_OSABI]``
1701 One of the following AMDGPU target architecture specific OS ABIs
1702 (see :ref:`amdgpu-os`):
1704 * ``ELFOSABI_NONE`` for *unknown* OS.
1706 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1708 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1710 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1712 ``e_ident[EI_ABIVERSION]``
1713 The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1716 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1717 runtime ABI for code object V2. Can no longer be emitted by this version of LLVM.
1719 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1720 runtime ABI for code object V3. Can no longer be emitted by this version of LLVM.
1722 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1723 runtime ABI for code object V4. Specify using the Clang option
1724 ``-mcode-object-version=4``.
1726 * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1727 runtime ABI for code object V5. Specify using the Clang option
1728 ``-mcode-object-version=5``. This is the default code object
1729 version if not specified.
1731 * ``ELFABIVERSION_AMDGPU_HSA_V6`` is used to specify the version of AMD HSA
1732 runtime ABI for code object V6. Specify using the Clang option
1733 ``-mcode-object-version=6``.
1735 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1738 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1742 Can be one of the following values:
1746 The type produced by the AMDGPU backend compiler as it is relocatable code
1750 The type produced by the linker as it is a shared code object.
1752 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1755 The value ``EM_AMDGPU`` is used for the machine for all processors supported
1756 by the ``r600`` and ``amdgcn`` architectures (see
1757 :ref:`amdgpu-processor-table`). The specific processor is specified in the
1758 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1759 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1760 ``e_flags`` for code object V3 and above (see
1761 :ref:`amdgpu-elf-header-e_flags-table-v3`,
1762 :ref:`amdgpu-elf-header-e_flags-table-v4-v5` and
1763 :ref:`amdgpu-elf-header-e_flags-table-v6-onwards`).
1766 The entry point is 0 as the entry points for individual kernels must be
1767 selected in order to invoke them through AQL packets.
1770 The AMDGPU backend uses the following ELF header flags:
1772 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1773 :name: amdgpu-elf-header-e_flags-v2-table
1775 ===================================== ===== =============================
1776 Name Value Description
1777 ===================================== ===== =============================
1778 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack``
1780 enabled for all code
1781 contained in the code object.
1783 does not support the
1788 :ref:`amdgpu-target-features`.
1789 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap
1790 handler is enabled for all
1791 code contained in the code
1792 object. If the processor
1793 does not support a trap
1794 handler then must be 0.
1796 :ref:`amdgpu-target-features`.
1797 ===================================== ===== =============================
1799 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1800 :name: amdgpu-elf-header-e_flags-table-v3
1802 ================================= ===== =============================
1803 Name Value Description
1804 ================================= ===== =============================
1805 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1807 ``EF_AMDGPU_MACH_xxx`` values
1809 :ref:`amdgpu-ef-amdgpu-mach-table`.
1810 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack``
1812 enabled for all code
1813 contained in the code object.
1815 does not support the
1820 :ref:`amdgpu-target-features`.
1821 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc``
1823 enabled for all code
1824 contained in the code object.
1826 does not support the
1831 :ref:`amdgpu-target-features`.
1832 ================================= ===== =============================
1834 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and V5
1835 :name: amdgpu-elf-header-e_flags-table-v4-v5
1837 ============================================ ===== ===================================
1838 Name Value Description
1839 ============================================ ===== ===================================
1840 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1842 ``EF_AMDGPU_MACH_xxx`` values
1844 :ref:`amdgpu-ef-amdgpu-mach-table`.
1845 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for
1846 ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1848 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsupported.
1849 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value.
1850 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled.
1851 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled.
1852 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for
1853 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1855 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsupported.
1856 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value.
1857 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled,
1858 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled.
1859 ============================================ ===== ===================================
1861 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V6 and After
1862 :name: amdgpu-elf-header-e_flags-table-v6-onwards
1864 ============================================ ========== =========================================
1865 Name Value Description
1866 ============================================ ========== =========================================
1867 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1869 ``EF_AMDGPU_MACH_xxx`` values
1871 :ref:`amdgpu-ef-amdgpu-mach-table`.
1872 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for
1873 ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1875 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsupported.
1876 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value.
1877 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled.
1878 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled.
1879 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for
1880 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1882 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsupported.
1883 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value.
1884 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled,
1885 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled.
1886 ``EF_AMDGPU_GENERIC_VERSION_V`` 0xff000000 Generic code object version selection
1887 mask. This is a value between 1 and 255,
1888 stored in the most significant byte
1890 See :ref:`amdgpu-generic-processor-versioning`
1891 ============================================ ========== =========================================
1893 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1894 :name: amdgpu-ef-amdgpu-mach-table
1896 ========================================== ========== =============================
1897 Name Value Description (see
1898 :ref:`amdgpu-processor-table`)
1899 ========================================== ========== =============================
1900 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
1901 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
1902 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
1903 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
1904 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
1905 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
1906 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
1907 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
1908 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
1909 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
1910 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
1911 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
1912 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
1913 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
1914 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
1915 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
1916 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
1917 *reserved* 0x011 - Reserved for ``r600``
1918 0x01f architecture processors.
1919 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
1920 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
1921 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
1922 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
1923 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
1924 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
1925 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
1926 *reserved* 0x027 Reserved.
1927 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
1928 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
1929 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
1930 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
1931 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
1932 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
1933 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
1934 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
1935 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908``
1936 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
1937 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c``
1938 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010``
1939 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011``
1940 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012``
1941 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030``
1942 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031``
1943 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032``
1944 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033``
1945 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602``
1946 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705``
1947 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805``
1948 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035``
1949 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034``
1950 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a``
1951 ``EF_AMDGPU_MACH_AMDGCN_GFX940`` 0x040 ``gfx940``
1952 ``EF_AMDGPU_MACH_AMDGCN_GFX1100`` 0x041 ``gfx1100``
1953 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013``
1954 ``EF_AMDGPU_MACH_AMDGCN_GFX1150`` 0x043 ``gfx1150``
1955 ``EF_AMDGPU_MACH_AMDGCN_GFX1103`` 0x044 ``gfx1103``
1956 ``EF_AMDGPU_MACH_AMDGCN_GFX1036`` 0x045 ``gfx1036``
1957 ``EF_AMDGPU_MACH_AMDGCN_GFX1101`` 0x046 ``gfx1101``
1958 ``EF_AMDGPU_MACH_AMDGCN_GFX1102`` 0x047 ``gfx1102``
1959 ``EF_AMDGPU_MACH_AMDGCN_GFX1200`` 0x048 ``gfx1200``
1960 *reserved* 0x049 Reserved.
1961 ``EF_AMDGPU_MACH_AMDGCN_GFX1151`` 0x04a ``gfx1151``
1962 ``EF_AMDGPU_MACH_AMDGCN_GFX941`` 0x04b ``gfx941``
1963 ``EF_AMDGPU_MACH_AMDGCN_GFX942`` 0x04c ``gfx942``
1964 *reserved* 0x04d Reserved.
1965 ``EF_AMDGPU_MACH_AMDGCN_GFX1201`` 0x04e ``gfx1201``
1966 *reserved* 0x04f Reserved.
1967 *reserved* 0x050 Reserved.
1968 ``EF_AMDGPU_MACH_AMDGCN_GFX9_GENERIC`` 0x051 ``gfx9-generic``
1969 ``EF_AMDGPU_MACH_AMDGCN_GFX10_1_GENERIC`` 0x052 ``gfx10-1-generic``
1970 ``EF_AMDGPU_MACH_AMDGCN_GFX10_3_GENERIC`` 0x053 ``gfx10-3-generic``
1971 ``EF_AMDGPU_MACH_AMDGCN_GFX11_GENERIC`` 0x054 ``gfx11-generic``
1972 *reserved* 0x055 Reserved.
1973 ========================================== ========== =============================
1978 An AMDGPU target ELF code object has the standard ELF sections which include:
1980 .. table:: AMDGPU ELF Sections
1981 :name: amdgpu-elf-sections-table
1983 ================== ================ =================================
1984 Name Type Attributes
1985 ================== ================ =================================
1986 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1987 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1988 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
1989 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
1990 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1991 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1992 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1993 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
1994 ``.note`` ``SHT_NOTE`` *none*
1995 ``.rela``\ *name* ``SHT_RELA`` *none*
1996 ``.rela.dyn`` ``SHT_RELA`` *none*
1997 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1998 ``.shstrtab`` ``SHT_STRTAB`` *none*
1999 ``.strtab`` ``SHT_STRTAB`` *none*
2000 ``.symtab`` ``SHT_SYMTAB`` *none*
2001 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
2002 ================== ================ =================================
2004 These sections have their standard meanings (see [ELF]_) and are only generated
2008 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
2009 information on the DWARF produced by the AMDGPU backend.
2011 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
2012 The standard sections used by a dynamic loader.
2015 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
2018 ``.rela``\ *name*, ``.rela.dyn``
2019 For relocatable code objects, *name* is the name of the section that the
2020 relocation records apply. For example, ``.rela.text`` is the section name for
2021 relocation records associated with the ``.text`` section.
2023 For linked shared code objects, ``.rela.dyn`` contains all the relocation
2024 records from each of the relocatable code object's ``.rela``\ *name* sections.
2026 See :ref:`amdgpu-relocation-records` for the relocation records supported by
2030 The executable machine code for the kernels and functions they call. Generated
2031 as position independent code. See :ref:`amdgpu-code-conventions` for
2032 information on conventions used in the isa generation.
2034 .. _amdgpu-note-records:
2039 The AMDGPU backend code object contains ELF note records in the ``.note``
2040 section. The set of generated notes and their semantics depend on the code
2041 object version; see :ref:`amdgpu-note-records-v2` and
2042 :ref:`amdgpu-note-records-v3-onwards`.
2044 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
2045 must be generated after the ``name`` field to ensure the ``desc`` field is 4
2046 byte aligned. In addition, minimal zero-byte padding must be generated to
2047 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
2048 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
2051 .. _amdgpu-note-records-v2:
2053 Code Object V2 Note Records
2054 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2057 Code object V2 generation is no longer supported by this version of LLVM.
2059 The AMDGPU backend code object uses the following ELF note record in the
2060 ``.note`` section when compiling for code object V2.
2062 The note record vendor field is "AMD".
2064 Additional note records may be present, but any which are not documented here
2065 are deprecated and should not be used.
2067 .. table:: AMDGPU Code Object V2 ELF Note Records
2068 :name: amdgpu-elf-note-records-v2-table
2070 ===== ===================================== ======================================
2071 Name Type Description
2072 ===== ===================================== ======================================
2073 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version.
2074 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL
2075 Finalizer and not the LLVM compiler.
2076 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version.
2077 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in
2078 YAML [YAML]_ textual format.
2079 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name.
2080 ===== ===================================== ======================================
2084 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
2085 :name: amdgpu-elf-note-record-enumeration-values-v2-table
2087 ===================================== =====
2089 ===================================== =====
2090 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1
2091 ``NT_AMD_HSA_HSAIL`` 2
2092 ``NT_AMD_HSA_ISA_VERSION`` 3
2094 ``NT_AMD_HSA_METADATA`` 10
2095 ``NT_AMD_HSA_ISA_NAME`` 11
2096 ===================================== =====
2098 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
2099 Specifies the code object version number. The description field has the
2104 struct amdgpu_hsa_note_code_object_version_s {
2105 uint32_t major_version;
2106 uint32_t minor_version;
2109 The ``major_version`` has a value less than or equal to 2.
2111 ``NT_AMD_HSA_HSAIL``
2112 Specifies the HSAIL properties used by the HSAIL Finalizer. The description
2113 field has the following layout:
2117 struct amdgpu_hsa_note_hsail_s {
2118 uint32_t hsail_major_version;
2119 uint32_t hsail_minor_version;
2121 uint8_t machine_model;
2122 uint8_t default_float_round;
2125 ``NT_AMD_HSA_ISA_VERSION``
2126 Specifies the target ISA version. The description field has the following layout:
2130 struct amdgpu_hsa_note_isa_s {
2131 uint16_t vendor_name_size;
2132 uint16_t architecture_name_size;
2136 char vendor_and_architecture_name[1];
2139 ``vendor_name_size`` and ``architecture_name_size`` are the length of the
2140 vendor and architecture names respectively, including the NUL character.
2142 ``vendor_and_architecture_name`` contains the NUL terminates string for the
2143 vendor, immediately followed by the NUL terminated string for the
2146 This note record is used by the HSA runtime loader.
2148 Code object V2 only supports a limited number of processors and has fixed
2149 settings for target features. See
2150 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
2151 processors and the corresponding target ID. In the table the note record ISA
2152 name is a concatenation of the vendor name, architecture name, major, minor,
2153 and stepping separated by a ":".
2155 The target ID column shows the processor name and fixed target features used
2156 by the LLVM compiler. The LLVM compiler does not generate a
2157 ``NT_AMD_HSA_HSAIL`` note record.
2159 A code object generated by the Finalizer also uses code object V2 and always
2160 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
2161 ``sramecc`` target feature is as shown in
2162 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
2163 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
2166 ``NT_AMD_HSA_ISA_NAME``
2167 Specifies the target ISA name as a non-NUL terminated string.
2169 This note record is not used by the HSA runtime loader.
2171 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
2172 V2's limited support of processors and fixed settings for target features.
2174 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
2175 from the string to the corresponding target ID. If the ``xnack`` target
2176 feature is supported and enabled, the string produced by the LLVM compiler
2177 will may have a ``+xnack`` appended. The Finlizer did not do the appending and
2178 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
2180 ``NT_AMD_HSA_METADATA``
2181 Specifies extensible metadata associated with the code objects executed on HSA
2182 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
2183 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
2184 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
2187 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
2188 :name: amdgpu-elf-note-record-supported_processors-v2-table
2190 ===================== ==========================
2191 Note Record ISA Name Target ID
2192 ===================== ==========================
2193 ``AMD:AMDGPU:6:0:0`` ``gfx600``
2194 ``AMD:AMDGPU:6:0:1`` ``gfx601``
2195 ``AMD:AMDGPU:6:0:2`` ``gfx602``
2196 ``AMD:AMDGPU:7:0:0`` ``gfx700``
2197 ``AMD:AMDGPU:7:0:1`` ``gfx701``
2198 ``AMD:AMDGPU:7:0:2`` ``gfx702``
2199 ``AMD:AMDGPU:7:0:3`` ``gfx703``
2200 ``AMD:AMDGPU:7:0:4`` ``gfx704``
2201 ``AMD:AMDGPU:7:0:5`` ``gfx705``
2202 ``AMD:AMDGPU:8:0:0`` ``gfx802``
2203 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+``
2204 ``AMD:AMDGPU:8:0:2`` ``gfx802``
2205 ``AMD:AMDGPU:8:0:3`` ``gfx803``
2206 ``AMD:AMDGPU:8:0:4`` ``gfx803``
2207 ``AMD:AMDGPU:8:0:5`` ``gfx805``
2208 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+``
2209 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-``
2210 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+``
2211 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-``
2212 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+``
2213 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-``
2214 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+``
2215 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-``
2216 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+``
2217 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
2218 ===================== ==========================
2220 .. _amdgpu-note-records-v3-onwards:
2222 Code Object V3 and Above Note Records
2223 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2225 The AMDGPU backend code object uses the following ELF note record in the
2226 ``.note`` section when compiling for code object V3 and above.
2228 The note record vendor field is "AMDGPU".
2230 Additional note records may be present, but any which are not documented here
2231 are deprecated and should not be used.
2233 .. table:: AMDGPU Code Object V3 and Above ELF Note Records
2234 :name: amdgpu-elf-note-records-table-v3-onwards
2236 ======== ============================== ======================================
2237 Name Type Description
2238 ======== ============================== ======================================
2239 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
2241 ======== ============================== ======================================
2245 .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
2246 :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
2248 ============================== =====
2250 ============================== =====
2252 ``NT_AMDGPU_METADATA`` 32
2253 ============================== =====
2255 ``NT_AMDGPU_METADATA``
2256 Specifies extensible metadata associated with an AMDGPU code object. It is
2257 encoded as a map in the Message Pack [MsgPack]_ binary data format. See
2258 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2259 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2260 :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
2268 Symbols include the following:
2270 .. table:: AMDGPU ELF Symbols
2271 :name: amdgpu-elf-symbols-table
2273 ===================== ================== ================ ==================
2274 Name Type Section Description
2275 ===================== ================== ================ ==================
2276 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
2279 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
2280 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
2281 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS
2282 ===================== ================== ================ ==================
2285 Global variables both used and defined by the compilation unit.
2287 If the symbol is defined in the compilation unit then it is allocated in the
2288 appropriate section according to if it has initialized data or is readonly.
2290 If the symbol is external then its section is ``STN_UNDEF`` and the loader
2291 will resolve relocations using the definition provided by another code object
2292 or explicitly defined by the runtime.
2294 If the symbol resides in local/group memory (LDS) then its section is the
2295 special processor specific section name ``SHN_AMDGPU_LDS``, and the
2296 ``st_value`` field describes alignment requirements as it does for common
2301 Add description of linked shared object symbols. Seems undefined symbols
2302 are marked as STT_NOTYPE.
2305 Every HSA kernel has an associated kernel descriptor. It is the address of the
2306 kernel descriptor that is used in the AQL dispatch packet used to invoke the
2307 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
2308 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
2311 Every HSA kernel also has a symbol for its machine code entry point.
2313 .. _amdgpu-relocation-records:
2318 The AMDGPU backend generates ``Elf64_Rela`` relocation records for
2319 AMDHSA or ``Elf64_Rel`` relocation records for Mesa/AMDPAL. Supported
2320 relocatable fields are:
2323 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
2324 alignment. These values use the same byte order as other word values in the
2325 AMDGPU architecture.
2328 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
2329 alignment. These values use the same byte order as other word values in the
2330 AMDGPU architecture.
2332 Following notations are used for specifying relocation calculations:
2335 Represents the addend used to compute the value of the relocatable field. If
2336 the addend field is smaller than 64 bits then it is zero-extended to 64 bits
2337 for use in the calculations below. (In practice this only affects ``_HI``
2338 relocation types on Mesa/AMDPAL, where the addend comes from the 32-bit field
2339 but the result of the calculation depends on the high part of the full 64-bit
2343 Represents the offset into the global offset table at which the relocation
2344 entry's symbol will reside during execution.
2347 Represents the address of the global offset table.
2350 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
2351 of the storage unit being relocated (computed using ``r_offset``).
2354 Represents the value of the symbol whose index resides in the relocation
2355 entry. Relocations not using this must specify a symbol index of
2359 Represents the base address of a loaded executable or shared object which is
2360 the difference between the ELF address and the actual load address.
2361 Relocations using this are only valid in executable or shared objects.
2363 The following relocation types are supported:
2365 .. table:: AMDGPU ELF Relocation Records
2366 :name: amdgpu-elf-relocation-records-table
2368 ========================== ======= ===== ========== ==============================
2369 Relocation Type Kind Value Field Calculation
2370 ========================== ======= ===== ========== ==============================
2371 ``R_AMDGPU_NONE`` 0 *none* *none*
2372 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
2374 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
2376 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
2378 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
2379 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
2380 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
2382 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
2383 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
2384 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
2385 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
2386 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
2388 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
2389 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4
2390 ========================== ======= ===== ========== ==============================
2392 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
2393 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
2395 There is no current OS loader support for 32-bit programs and so
2396 ``R_AMDGPU_ABS32`` is not used.
2398 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
2400 Loaded Code Object Path Uniform Resource Identifier (URI)
2401 ---------------------------------------------------------
2403 The AMD GPU code object loader represents the path of the ELF shared object from
2404 which the code object was loaded as a textual Uniform Resource Identifier (URI).
2405 Note that the code object is the in memory loaded relocated form of the ELF
2406 shared object. Multiple code objects may be loaded at different memory
2407 addresses in the same process from the same ELF shared object.
2409 The loaded code object path URI syntax is defined by the following BNF syntax:
2413 code_object_uri ::== file_uri | memory_uri
2414 file_uri ::== "file://" file_path [ range_specifier ]
2415 memory_uri ::== "memory://" process_id range_specifier
2416 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
2417 file_path ::== URI_ENCODED_OS_FILE_PATH
2418 process_id ::== DECIMAL_NUMBER
2419 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
2422 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
2423 and octal values by "0".
2426 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
2427 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
2428 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in
2429 the path are separated by "/".
2432 Is a 0-based byte offset to the start of the code object. For a file URI, it
2433 is from the start of the file specified by the ``file_path``, and if omitted
2434 defaults to 0. For a memory URI, it is the memory address and is required.
2437 Is the number of bytes in the code object. For a file URI, if omitted it
2438 defaults to the size of the file. It is required for a memory URI.
2441 Is the identity of the process owning the memory. For Linux it is the C
2442 unsigned integral decimal literal for the process ID (PID).
2448 file:///dir1/dir2/file1
2449 file:///dir3/dir4/file2#offset=0x2000&size=3000
2450 memory://1234#offset=0x20000&size=3000
2452 .. _amdgpu-dwarf-debug-information:
2454 DWARF Debug Information
2455 =======================
2459 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
2460 is not currently fully implemented and is subject to change.
2462 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
2463 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
2464 object executable code and data to the source language constructs. It can be
2465 used by tools such as debuggers and profilers. It uses features defined in
2466 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
2467 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
2469 This section defines the AMDGPU target architecture specific DWARF mappings.
2471 .. _amdgpu-dwarf-register-identifier:
2476 This section defines the AMDGPU target architecture register numbers used in
2477 DWARF operation expressions (see DWARF Version 5 section 2.5 and
2478 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
2479 instructions (see DWARF Version 5 section 6.4 and
2480 :ref:`amdgpu-dwarf-call-frame-information`).
2482 A single code object can contain code for kernels that have different wavefront
2483 sizes. The vector registers and some scalar registers are based on the wavefront
2484 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
2485 simplifies the consumer of the DWARF so that each register has a fixed size,
2486 rather than being dynamic according to the wavefront size mode. Similarly,
2487 distinct DWARF registers are defined for those registers that vary in size
2488 according to the process address size. This allows a consumer to treat a
2489 specific AMDGPU processor as a single architecture regardless of how it is
2490 configured at run time. The compiler explicitly specifies the DWARF registers
2491 that match the mode in which the code it is generating will be executed.
2493 DWARF registers are encoded as numbers, which are mapped to architecture
2494 registers. The mapping for AMDGPU is defined in
2495 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
2498 .. table:: AMDGPU DWARF Register Mapping
2499 :name: amdgpu-dwarf-register-mapping-table
2501 ============== ================= ======== ==================================
2502 DWARF Register AMDGPU Register Bit Size Description
2503 ============== ================= ======== ==================================
2504 0 PC_32 32 Program Counter (PC) when
2505 executing in a 32-bit process
2506 address space. Used in the CFI to
2507 describe the PC of the calling
2509 1 EXEC_MASK_32 32 Execution Mask Register when
2510 executing in wavefront 32 mode.
2511 2-15 *Reserved* *Reserved for highly accessed
2512 registers using DWARF shortcut.*
2513 16 PC_64 64 Program Counter (PC) when
2514 executing in a 64-bit process
2515 address space. Used in the CFI to
2516 describe the PC of the calling
2518 17 EXEC_MASK_64 64 Execution Mask Register when
2519 executing in wavefront 64 mode.
2520 18-31 *Reserved* *Reserved for highly accessed
2521 registers using DWARF shortcut.*
2522 32-95 SGPR0-SGPR63 32 Scalar General Purpose
2524 96-127 *Reserved* *Reserved for frequently accessed
2525 registers using DWARF 1-byte ULEB.*
2526 128 STATUS 32 Status Register.
2527 129-511 *Reserved* *Reserved for future Scalar
2528 Architectural Registers.*
2529 512 VCC_32 32 Vector Condition Code Register
2530 when executing in wavefront 32
2532 513-767 *Reserved* *Reserved for future Vector
2533 Architectural Registers when
2534 executing in wavefront 32 mode.*
2535 768 VCC_64 64 Vector Condition Code Register
2536 when executing in wavefront 64
2538 769-1023 *Reserved* *Reserved for future Vector
2539 Architectural Registers when
2540 executing in wavefront 64 mode.*
2541 1024-1087 *Reserved* *Reserved for padding.*
2542 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers.
2543 1130-1535 *Reserved* *Reserved for future Scalar
2544 General Purpose Registers.*
2545 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers
2546 when executing in wavefront 32
2548 1792-2047 *Reserved* *Reserved for future Vector
2549 General Purpose Registers when
2550 executing in wavefront 32 mode.*
2551 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers
2552 when executing in wavefront 32
2554 2304-2559 *Reserved* *Reserved for future Vector
2555 Accumulation Registers when
2556 executing in wavefront 32 mode.*
2557 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers
2558 when executing in wavefront 64
2560 2816-3071 *Reserved* *Reserved for future Vector
2561 General Purpose Registers when
2562 executing in wavefront 64 mode.*
2563 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers
2564 when executing in wavefront 64
2566 3328-3583 *Reserved* *Reserved for future Vector
2567 Accumulation Registers when
2568 executing in wavefront 64 mode.*
2569 ============== ================= ======== ==================================
2571 The vector registers are represented as the full size for the wavefront. They
2572 are organized as consecutive dwords (32-bits), one per lane, with the dword at
2573 the least significant bit position corresponding to lane 0 and so forth. DWARF
2574 location expressions involving the ``DW_OP_LLVM_offset`` and
2575 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
2576 register corresponding to the lane that is executing the current thread of
2577 execution in languages that are implemented using a SIMD or SIMT execution
2580 If the wavefront size is 32 lanes then the wavefront 32 mode register
2581 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
2582 mode register definitions are used. Some AMDGPU targets support executing in
2583 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
2584 to the wavefront mode of the generated code will be used.
2586 If code is generated to execute in a 32-bit process address space, then the
2587 32-bit process address space register definitions are used. If code is generated
2588 to execute in a 64-bit process address space, then the 64-bit process address
2589 space register definitions are used. The ``amdgcn`` target only supports the
2590 64-bit process address space.
2592 .. _amdgpu-dwarf-memory-space-identifier:
2594 Memory Space Identifier
2595 -----------------------
2597 The DWARF memory space represents the source language memory space. See DWARF
2598 Version 5 section 2.12 which is updated by the *DWARF Extensions For
2599 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-memory-spaces`.
2601 The DWARF memory space mapping used for AMDGPU is defined in
2602 :ref:`amdgpu-dwarf-memory-space-mapping-table`.
2604 .. table:: AMDGPU DWARF Memory Space Mapping
2605 :name: amdgpu-dwarf-memory-space-mapping-table
2607 =========================== ====== =================
2609 ---------------------------------- -----------------
2610 Memory Space Name Value Memory Space
2611 =========================== ====== =================
2612 ``DW_MSPACE_LLVM_none`` 0x0000 Generic (Flat)
2613 ``DW_MSPACE_LLVM_global`` 0x0001 Global
2614 ``DW_MSPACE_LLVM_constant`` 0x0002 Global
2615 ``DW_MSPACE_LLVM_group`` 0x0003 Local (group/LDS)
2616 ``DW_MSPACE_LLVM_private`` 0x0004 Private (Scratch)
2617 ``DW_MSPACE_AMDGPU_region`` 0x8000 Region (GDS)
2618 =========================== ====== =================
2620 The DWARF memory space values defined in the *DWARF Extensions For Heterogeneous
2621 Debugging* section :ref:`amdgpu-dwarf-memory-spaces` are used.
2623 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
2624 available for use for the AMD extension for access to the hardware GDS memory
2625 which is scratchpad memory allocated per device.
2627 For AMDGPU if no ``DW_AT_LLVM_memory_space`` attribute is present, then the
2628 default memory space of ``DW_MSPACE_LLVM_none`` is used.
2630 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
2631 mapping of DWARF memory spaces to DWARF address spaces, including address size
2634 .. _amdgpu-dwarf-address-space-identifier:
2636 Address Space Identifier
2637 ------------------------
2639 DWARF address spaces correspond to target architecture specific linear
2640 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
2641 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-address-spaces`.
2643 The DWARF address space mapping used for AMDGPU is defined in
2644 :ref:`amdgpu-dwarf-address-space-mapping-table`.
2646 .. table:: AMDGPU DWARF Address Space Mapping
2647 :name: amdgpu-dwarf-address-space-mapping-table
2649 ======================================= ===== ======= ======== ===================== =======================
2651 --------------------------------------- ----- ---------------- --------------------- -----------------------
2652 Address Space Name Value Address Bit Size LLVM IR Address Space
2653 --------------------------------------- ----- ------- -------- --------------------- -----------------------
2658 ======================================= ===== ======= ======== ===================== =======================
2659 ``DW_ASPACE_LLVM_none`` 0x00 64 32 Global *default address space*
2660 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat)
2661 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS)
2662 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS)
2664 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane*
2665 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront*
2666 ======================================= ===== ======= ======== ===================== =======================
2668 See :ref:`amdgpu-address-spaces` for information on the AMDGPU LLVM IR address
2669 spaces including address size and NULL value.
2671 The ``DW_ASPACE_LLVM_none`` address space is the default target architecture
2672 address space used in DWARF operations that do not specify an address space. It
2673 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
2674 related operations can refer to addresses in the program code.
2676 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
2677 specify the flat address space. If the address corresponds to an address in the
2678 local address space, then it corresponds to the wavefront that is executing the
2679 focused thread of execution. If the address corresponds to an address in the
2680 private address space, then it corresponds to the lane that is executing the
2681 focused thread of execution for languages that are implemented using a SIMD or
2682 SIMT execution model.
2686 CUDA-like languages such as HIP that do not have address spaces in the
2687 language type system, but do allow variables to be allocated in different
2688 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
2689 address space in the DWARF expression operations as the default address space
2690 is the global address space.
2692 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2693 specify the local address space corresponding to the wavefront that is executing
2694 the focused thread of execution.
2696 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2697 to specify the private address space corresponding to the lane that is executing
2698 the focused thread of execution for languages that are implemented using a SIMD
2699 or SIMT execution model.
2701 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2702 to specify the unswizzled private address space corresponding to the wavefront
2703 that is executing the focused thread of execution. The wavefront view of private
2704 memory is the per wavefront unswizzled backing memory layout defined in
2705 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2706 location for the backing memory of the wavefront (namely the address is not
2707 offset by ``wavefront-scratch-base``). The following formula can be used to
2708 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2709 ``DW_ASPACE_AMDGPU_private_wave`` address:
2713 private-address-wavefront =
2714 ((private-address-lane / 4) * wavefront-size * 4) +
2715 (wavefront-lane-id * 4) + (private-address-lane % 4)
2717 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2718 of the dwords for each lane starting with lane 0 is required, then this
2723 private-address-wavefront =
2724 private-address-lane * wavefront-size
2726 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2727 complete spilled vector register back into a complete vector register in the
2728 CFI. The frame pointer can be a private lane address which is dword aligned,
2729 which can be shifted to multiply by the wavefront size, and then used to form a
2730 private wavefront address that gives a location for a contiguous set of dwords,
2731 one per lane, where the vector register dwords are spilled. The compiler knows
2732 the wavefront size since it generates the code. Note that the type of the
2733 address may have to be converted as the size of a
2734 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2735 ``DW_ASPACE_AMDGPU_private_wave`` address.
2737 .. _amdgpu-dwarf-lane-identifier:
2742 DWARF lane identifies specify a target architecture lane position for hardware
2743 that executes in a SIMD or SIMT manner, and on which a source language maps its
2744 threads of execution onto those lanes. The DWARF lane identifier is pushed by
2745 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2746 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2747 section :ref:`amdgpu-dwarf-operation-expressions`.
2749 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2750 wavefront. It is numbered from 0 to the wavefront size minus 1.
2752 Operation Expressions
2753 ---------------------
2755 DWARF expressions are used to compute program values and the locations of
2756 program objects. See DWARF Version 5 section 2.5 and
2757 :ref:`amdgpu-dwarf-operation-expressions`.
2759 DWARF location descriptions describe how to access storage which includes memory
2760 and registers. When accessing storage on AMDGPU, bytes are ordered with least
2761 significant bytes first, and bits are ordered within bytes with least
2762 significant bits first.
2764 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2765 unwinding vector registers that are spilled under the execution mask to memory:
2766 the zero-single location description is the vector register, and the one-single
2767 location description is the spilled memory location description. The
2768 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2769 memory location description.
2771 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2772 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2773 controlled by the execution mask. An undefined location description together
2774 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2775 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2777 Debugger Information Entry Attributes
2778 -------------------------------------
2780 This section describes how certain debugger information entry attributes are
2781 used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2782 which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2783 :ref:`amdgpu-dwarf-low-level-information` and
2784 :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2786 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2788 ``DW_AT_LLVM_lane_pc``
2789 ~~~~~~~~~~~~~~~~~~~~~~
2791 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2792 location of the separate lanes of a SIMT thread.
2794 If the lane is an active lane then this will be the same as the current program
2797 If the lane is inactive, but was active on entry to the subprogram, then this is
2798 the program location in the subprogram at which execution of the lane is
2799 conceptual positioned.
2801 If the lane was not active on entry to the subprogram, then this will be the
2802 undefined location. A client debugger can check if the lane is part of a valid
2803 work-group by checking that the lane is in the range of the associated
2804 work-group within the grid, accounting for partial work-groups. If it is not,
2805 then the debugger can omit any information for the lane. Otherwise, the debugger
2806 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2807 calling subprogram until it finds a non-undefined location. Conceptually the
2808 lane only has the call frames that it has a non-undefined
2809 ``DW_AT_LLVM_lane_pc``.
2811 The following example illustrates how the AMDGPU backend can generate a DWARF
2812 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2813 following subprogram pseudo code for a target with 64 lanes per wavefront.
2835 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2836 execution mask (``EXEC``) to linearize the control flow. The condition is
2837 evaluated to make a mask of the lanes for which the condition evaluates to true.
2838 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2839 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2840 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2841 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2842 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2843 region. This is shown below. Other approaches are possible, but the basic
2844 concept is the same.
2877 To create the DWARF location list expression that defines the location
2878 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2879 pseudo instruction can be used to annotate the linearized control flow. This can
2880 be done by defining an artificial variable for the lane PC. The DWARF location
2881 list expression created for it is used as the value of the
2882 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2884 A DWARF procedure is defined for each well nested structured control flow region
2885 which provides the conceptual lane program location for a lane if it is not
2886 active (namely it is divergent). The DWARF operation expression for each region
2887 conceptually inherits the value of the immediately enclosing region and modifies
2888 it according to the semantics of the region.
2890 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2891 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2892 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2893 region since the ``THEN`` region has completed.
2895 The lane PC artificial variable is assigned at each region transition. It uses
2896 the immediately enclosing region's DWARF procedure to compute the program
2897 location for each lane assuming they are divergent, and then modifies the result
2898 by inserting the current program location for each lane that the ``EXEC`` mask
2899 indicates is active.
2901 By having separate DWARF procedures for each region, they can be reused to
2902 define the value for any nested region. This reduces the total size of the DWARF
2903 operation expressions.
2905 The following provides an example using pseudo LLVM MIR.
2911 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2912 DW_AT_name = "__uint64";
2913 DW_AT_byte_size = 8;
2914 DW_AT_encoding = DW_ATE_unsigned;
2916 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2917 DW_AT_name = "__active_lane_pc";
2920 DW_OP_LLVM_extend 64, 64;
2921 DW_OP_regval_type EXEC, %uint_64;
2922 DW_OP_LLVM_select_bit_piece 64, 64;
2925 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2926 DW_AT_name = "__divergent_lane_pc";
2928 DW_OP_LLVM_undefined;
2929 DW_OP_LLVM_extend 64, 64;
2932 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2933 DW_OP_call_ref %__divergent_lane_pc;
2934 DW_OP_call_ref %__active_lane_pc;
2938 DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2943 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2944 DW_AT_name = "__divergent_lane_pc_1_then";
2945 DW_AT_location = DIExpression[
2946 DW_OP_call_ref %__divergent_lane_pc;
2947 DW_OP_addrx &lex_1_start;
2949 DW_OP_LLVM_extend 64, 64;
2950 DW_OP_call_ref %__lex_1_save_exec;
2951 DW_OP_deref_type 64, %__uint_64;
2952 DW_OP_LLVM_select_bit_piece 64, 64;
2955 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2956 DW_OP_call_ref %__divergent_lane_pc_1_then;
2957 DW_OP_call_ref %__active_lane_pc;
2961 DBG_VALUE %3, %__lex_1_1_save_exec;
2966 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2967 DW_AT_name = "__divergent_lane_pc_1_1_then";
2968 DW_AT_location = DIExpression[
2969 DW_OP_call_ref %__divergent_lane_pc_1_then;
2970 DW_OP_addrx &lex_1_1_start;
2972 DW_OP_LLVM_extend 64, 64;
2973 DW_OP_call_ref %__lex_1_1_save_exec;
2974 DW_OP_deref_type 64, %__uint_64;
2975 DW_OP_LLVM_select_bit_piece 64, 64;
2978 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2979 DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2980 DW_OP_call_ref %__active_lane_pc;
2985 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2986 DW_AT_name = "__divergent_lane_pc_1_1_else";
2987 DW_AT_location = DIExpression[
2988 DW_OP_call_ref %__divergent_lane_pc_1_then;
2989 DW_OP_addrx &lex_1_1_end;
2991 DW_OP_LLVM_extend 64, 64;
2992 DW_OP_call_ref %__lex_1_1_save_exec;
2993 DW_OP_deref_type 64, %__uint_64;
2994 DW_OP_LLVM_select_bit_piece 64, 64;
2997 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2998 DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2999 DW_OP_call_ref %__active_lane_pc;
3004 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
3005 DW_OP_call_ref %__divergent_lane_pc;
3006 DW_OP_call_ref %__active_lane_pc;
3011 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
3012 DW_AT_name = "__divergent_lane_pc_1_else";
3013 DW_AT_location = DIExpression[
3014 DW_OP_call_ref %__divergent_lane_pc;
3015 DW_OP_addrx &lex_1_end;
3017 DW_OP_LLVM_extend 64, 64;
3018 DW_OP_call_ref %__lex_1_save_exec;
3019 DW_OP_deref_type 64, %__uint_64;
3020 DW_OP_LLVM_select_bit_piece 64, 64;
3023 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
3024 DW_OP_call_ref %__divergent_lane_pc_1_else;
3025 DW_OP_call_ref %__active_lane_pc;
3030 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
3031 DW_OP_call_ref %__divergent_lane_pc;
3032 DW_OP_call_ref %__active_lane_pc;
3037 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
3038 that are active, with the current program location.
3040 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
3041 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
3042 instruction, location list entries will be created that describe where the
3043 artificial variables are allocated at any given program location. The compiler
3044 may allocate them to registers or spill them to memory.
3046 The DWARF procedures for each region use the values of the saved execution mask
3047 artificial variables to only update the lanes that are active on entry to the
3048 region. All other lanes retain the value of the enclosing region where they were
3049 last active. If they were not active on entry to the subprogram, then will have
3050 the undefined location description.
3052 Other structured control flow regions can be handled similarly. For example,
3053 loops would set the divergent program location for the region at the end of the
3054 loop. Any lanes active will be in the loop, and any lanes not active must have
3057 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
3058 ``IF/THEN/ELSE`` regions.
3060 The DWARF procedures can use the active lane artificial variable described in
3061 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
3062 ``EXEC`` mask in order to support whole or quad wavefront mode.
3064 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
3066 ``DW_AT_LLVM_active_lane``
3067 ~~~~~~~~~~~~~~~~~~~~~~~~~~
3069 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
3070 entry is used to specify the lanes that are conceptually active for a SIMT
3073 The execution mask may be modified to implement whole or quad wavefront mode
3074 operations. For example, all lanes may need to temporarily be made active to
3075 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
3076 update it to enable the necessary lanes, perform the operations, and then
3077 restore the ``EXEC`` mask from the saved value. While executing the whole
3078 wavefront region, the conceptual execution mask is the saved value, not the
3081 This is handled by defining an artificial variable for the active lane mask. The
3082 active lane mask artificial variable would be the actual ``EXEC`` mask for
3083 normal regions, and the saved execution mask for regions where the mask is
3084 temporarily updated. The location list expression created for this artificial
3085 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
3088 ``DW_AT_LLVM_augmentation``
3089 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
3091 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
3092 debugger information entry has the following value for the augmentation string:
3098 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
3099 extensions used in the DWARF of the compilation unit. The version number
3100 conforms to [SEMVER]_.
3102 Call Frame Information
3103 ----------------------
3105 DWARF Call Frame Information (CFI) describes how a consumer can virtually
3106 *unwind* call frames in a running process or core dump. See DWARF Version 5
3107 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
3109 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
3111 1. ``augmentation`` string contains the following null-terminated UTF-8 string:
3117 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
3118 extensions used in this CIE or to the FDEs that use it. The version number
3119 conforms to [SEMVER]_.
3121 2. ``address_size`` for the ``Global`` address space is defined in
3122 :ref:`amdgpu-dwarf-address-space-identifier`.
3124 3. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
3126 4. ``code_alignment_factor`` is 4 bytes.
3130 Add to :ref:`amdgpu-processor-table` table.
3132 5. ``data_alignment_factor`` is 4 bytes.
3136 Add to :ref:`amdgpu-processor-table` table.
3138 6. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
3139 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
3141 7. ``initial_instructions`` Since a subprogram X with fewer registers can be
3142 called from subprogram Y that has more allocated, X will not change any of
3143 the extra registers as it cannot access them. Therefore, the default rule
3144 for all columns is ``same value``.
3146 For AMDGPU the register number follows the numbering defined in
3147 :ref:`amdgpu-dwarf-register-identifier`.
3149 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
3150 the return address to get the address of a byte within the call site
3151 instructions. See DWARF Version 5 section 6.4.4.
3156 See DWARF Version 5 section 6.1.
3158 Lookup By Name Section Header
3159 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3161 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
3163 For AMDGPU the lookup by name section header table:
3165 ``augmentation_string_size`` (uword)
3167 Set to the length of the ``augmentation_string`` value which is always a
3170 ``augmentation_string`` (sequence of UTF-8 characters)
3172 Contains the following UTF-8 string null padded to a multiple of 4 bytes:
3178 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
3179 extensions used in the DWARF of this index. The version number conforms to
3184 This is different to the DWARF Version 5 definition that requires the first
3185 4 characters to be the vendor ID. But this is consistent with the other
3186 augmentation strings and does allow multiple vendor contributions. However,
3187 backwards compatibility may be more desirable.
3189 Lookup By Address Section Header
3190 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3192 See DWARF Version 5 section 6.1.2.
3194 For AMDGPU the lookup by address section header table:
3196 ``address_size`` (ubyte)
3198 Match the address size for the ``Global`` address space defined in
3199 :ref:`amdgpu-dwarf-address-space-identifier`.
3201 ``segment_selector_size`` (ubyte)
3203 AMDGPU does not use a segment selector so this is 0. The entries in the
3204 ``.debug_aranges`` do not have a segment selector.
3206 Line Number Information
3207 -----------------------
3209 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
3211 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
3212 The instruction set must be obtained from the ELF file header ``e_flags`` field
3213 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
3214 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
3218 Should the ``isa`` state machine register be used to indicate if the code is
3219 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
3221 For AMDGPU the line number program header fields have the following values (see
3222 DWARF Version 5 section 6.2.4):
3224 ``address_size`` (ubyte)
3225 Matches the address size for the ``Global`` address space defined in
3226 :ref:`amdgpu-dwarf-address-space-identifier`.
3228 ``segment_selector_size`` (ubyte)
3229 AMDGPU does not use a segment selector so this is 0.
3231 ``minimum_instruction_length`` (ubyte)
3232 For GFX9-GFX11 this is 4.
3234 ``maximum_operations_per_instruction`` (ubyte)
3235 For GFX9-GFX11 this is 1.
3237 Source text for online-compiled programs (for example, those compiled by the
3238 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
3239 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
3240 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
3241 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
3243 The Clang option used to control source embedding in AMDGPU is defined in
3244 :ref:`amdgpu-clang-debug-options-table`.
3246 .. table:: AMDGPU Clang Debug Options
3247 :name: amdgpu-clang-debug-options-table
3249 ==================== ==================================================
3250 Debug Flag Description
3251 ==================== ==================================================
3252 -g[no-]embed-source Enable/disable embedding source text in DWARF
3253 debug sections. Useful for environments where
3254 source cannot be written to disk, such as
3255 when performing online compilation.
3256 ==================== ==================================================
3261 Enable the embedded source.
3263 ``-gno-embed-source``
3264 Disable the embedded source.
3266 32-Bit and 64-Bit DWARF Formats
3267 -------------------------------
3269 See DWARF Version 5 section 7.4 and
3270 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
3274 * For the ``amdgcn`` target architecture only the 64-bit process address space
3277 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
3278 the 32-bit DWARF format.
3283 For AMDGPU the following values apply for each of the unit headers described in
3284 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
3286 ``address_size`` (ubyte)
3287 Matches the address size for the ``Global`` address space defined in
3288 :ref:`amdgpu-dwarf-address-space-identifier`.
3290 .. _amdgpu-code-conventions:
3295 This section provides code conventions used for each supported target triple OS
3296 (see :ref:`amdgpu-target-triples`).
3301 This section provides code conventions used when the target triple OS is
3302 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
3304 .. _amdgpu-amdhsa-code-object-metadata:
3306 Code Object Metadata
3307 ~~~~~~~~~~~~~~~~~~~~
3309 The code object metadata specifies extensible metadata associated with the code
3310 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
3311 encoding and semantics of this metadata depends on the code object version; see
3312 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
3313 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
3314 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
3315 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
3317 Code object metadata is specified in a note record (see
3318 :ref:`amdgpu-note-records`) and is required when the target triple OS is
3319 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
3320 information necessary to support the HSA compatible runtime kernel queries. For
3321 example, the segment sizes needed in a dispatch packet. In addition, a
3322 high-level language runtime may require other information to be included. For
3323 example, the AMD OpenCL runtime records kernel argument information.
3325 .. _amdgpu-amdhsa-code-object-metadata-v2:
3327 Code Object V2 Metadata
3328 +++++++++++++++++++++++
3331 Code object V2 generation is no longer supported by this version of LLVM.
3333 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
3334 (see :ref:`amdgpu-note-records-v2`).
3336 The metadata is specified as a YAML formatted string (see [YAML]_ and
3341 Is the string null terminated? It probably should not if YAML allows it to
3342 contain null characters, otherwise it should be.
3344 The metadata is represented as a single YAML document comprised of the mapping
3345 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
3348 For boolean values, the string values of ``false`` and ``true`` are used for
3349 false and true respectively.
3351 Additional information can be added to the mappings. To avoid conflicts, any
3352 non-AMD key names should be prefixed by "*vendor-name*.".
3354 .. table:: AMDHSA Code Object V2 Metadata Map
3355 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
3357 ========== ============== ========= =======================================
3358 String Key Value Type Required? Description
3359 ========== ============== ========= =======================================
3360 "Version" sequence of Required - The first integer is the major
3361 2 integers version. Currently 1.
3362 - The second integer is the minor
3363 version. Currently 0.
3364 "Printf" sequence of Each string is encoded information
3365 strings about a printf function call. The
3366 encoded information is organized as
3367 fields separated by colon (':'):
3369 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3374 A 32-bit integer as a unique id for
3375 each printf function call
3378 A 32-bit integer equal to the number
3379 of arguments of printf function call
3382 ``S[i]`` (where i = 0, 1, ... , N-1)
3383 32-bit integers for the size in bytes
3384 of the i-th FormatString argument of
3385 the printf function call
3388 The format string passed to the
3389 printf function call.
3390 "Kernels" sequence of Required Sequence of the mappings for each
3391 mapping kernel in the code object. See
3392 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
3393 for the definition of the mapping.
3394 ========== ============== ========= =======================================
3398 .. table:: AMDHSA Code Object V2 Kernel Metadata Map
3399 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
3401 ================= ============== ========= ================================
3402 String Key Value Type Required? Description
3403 ================= ============== ========= ================================
3404 "Name" string Required Source name of the kernel.
3405 "SymbolName" string Required Name of the kernel
3406 descriptor ELF symbol.
3407 "Language" string Source language of the kernel.
3415 "LanguageVersion" sequence of - The first integer is the major
3417 - The second integer is the
3419 "Attrs" mapping Mapping of kernel attributes.
3421 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
3422 for the mapping definition.
3423 "Args" sequence of Sequence of mappings of the
3424 mapping kernel arguments. See
3425 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
3426 for the definition of the mapping.
3427 "CodeProps" mapping Mapping of properties related to
3428 the kernel code. See
3429 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
3430 for the mapping definition.
3431 ================= ============== ========= ================================
3435 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
3436 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
3438 =================== ============== ========= ==============================
3439 String Key Value Type Required? Description
3440 =================== ============== ========= ==============================
3441 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
3442 3 integers must be >=1 and the dispatch
3443 work-group size X, Y, Z must
3444 correspond to the specified
3445 values. Defaults to 0, 0, 0.
3447 Corresponds to the OpenCL
3448 ``reqd_work_group_size``
3450 "WorkGroupSizeHint" sequence of The dispatch work-group size
3451 3 integers X, Y, Z is likely to be the
3454 Corresponds to the OpenCL
3455 ``work_group_size_hint``
3457 "VecTypeHint" string The name of a scalar or vector
3460 Corresponds to the OpenCL
3461 ``vec_type_hint`` attribute.
3463 "RuntimeHandle" string The external symbol name
3464 associated with a kernel.
3465 OpenCL runtime allocates a
3466 global buffer for the symbol
3467 and saves the kernel's address
3468 to it, which is used for
3469 device side enqueueing. Only
3470 available for device side
3472 =================== ============== ========= ==============================
3476 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
3477 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
3479 ================= ============== ========= ================================
3480 String Key Value Type Required? Description
3481 ================= ============== ========= ================================
3482 "Name" string Kernel argument name.
3483 "TypeName" string Kernel argument type name.
3484 "Size" integer Required Kernel argument size in bytes.
3485 "Align" integer Required Kernel argument alignment in
3486 bytes. Must be a power of two.
3487 "ValueKind" string Required Kernel argument kind that
3488 specifies how to set up the
3489 corresponding argument.
3493 The argument is copied
3494 directly into the kernarg.
3497 A global address space pointer
3498 to the buffer data is passed
3501 "DynamicSharedPointer"
3502 A group address space pointer
3503 to dynamically allocated LDS
3504 is passed in the kernarg.
3507 A global address space
3508 pointer to a S# is passed in
3512 A global address space
3513 pointer to a T# is passed in
3517 A global address space pointer
3518 to an OpenCL pipe is passed in
3522 A global address space pointer
3523 to an OpenCL device enqueue
3524 queue is passed in the
3527 "HiddenGlobalOffsetX"
3528 The OpenCL grid dispatch
3529 global offset for the X
3530 dimension is passed in the
3533 "HiddenGlobalOffsetY"
3534 The OpenCL grid dispatch
3535 global offset for the Y
3536 dimension is passed in the
3539 "HiddenGlobalOffsetZ"
3540 The OpenCL grid dispatch
3541 global offset for the Z
3542 dimension is passed in the
3546 An argument that is not used
3547 by the kernel. Space needs to
3548 be left for it, but it does
3549 not need to be set up.
3551 "HiddenPrintfBuffer"
3552 A global address space pointer
3553 to the runtime printf buffer
3554 is passed in kernarg. Mutually
3556 "HiddenHostcallBuffer".
3558 "HiddenHostcallBuffer"
3559 A global address space pointer
3560 to the runtime hostcall buffer
3561 is passed in kernarg. Mutually
3563 "HiddenPrintfBuffer".
3565 "HiddenDefaultQueue"
3566 A global address space pointer
3567 to the OpenCL device enqueue
3568 queue that should be used by
3569 the kernel by default is
3570 passed in the kernarg.
3572 "HiddenCompletionAction"
3573 A global address space pointer
3574 to help link enqueued kernels into
3575 the ancestor tree for determining
3576 when the parent kernel has finished.
3578 "HiddenMultiGridSyncArg"
3579 A global address space pointer for
3580 multi-grid synchronization is
3581 passed in the kernarg.
3583 "ValueType" string Unused and deprecated. This should no longer
3584 be emitted, but is accepted for compatibility.
3587 "PointeeAlign" integer Alignment in bytes of pointee
3588 type for pointer type kernel
3589 argument. Must be a power
3590 of 2. Only present if
3592 "DynamicSharedPointer".
3593 "AddrSpaceQual" string Kernel argument address space
3594 qualifier. Only present if
3595 "ValueKind" is "GlobalBuffer" or
3596 "DynamicSharedPointer". Values
3608 Is GlobalBuffer only Global
3610 DynamicSharedPointer always
3611 Local? Can HCC allow Generic?
3612 How can Private or Region
3615 "AccQual" string Kernel argument access
3616 qualifier. Only present if
3617 "ValueKind" is "Image" or
3630 "ActualAccQual" string The actual memory accesses
3631 performed by the kernel on the
3632 kernel argument. Only present if
3633 "ValueKind" is "GlobalBuffer",
3634 "Image", or "Pipe". This may be
3635 more restrictive than indicated
3636 by "AccQual" to reflect what the
3637 kernel actual does. If not
3638 present then the runtime must
3639 assume what is implied by
3640 "AccQual" and "IsConst". Values
3647 "IsConst" boolean Indicates if the kernel argument
3648 is const qualified. Only present
3652 "IsRestrict" boolean Indicates if the kernel argument
3653 is restrict qualified. Only
3654 present if "ValueKind" is
3657 "IsVolatile" boolean Indicates if the kernel argument
3658 is volatile qualified. Only
3659 present if "ValueKind" is
3662 "IsPipe" boolean Indicates if the kernel argument
3663 is pipe qualified. Only present
3664 if "ValueKind" is "Pipe".
3668 Can GlobalBuffer be pipe
3671 ================= ============== ========= ================================
3675 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
3676 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
3678 ============================ ============== ========= =====================
3679 String Key Value Type Required? Description
3680 ============================ ============== ========= =====================
3681 "KernargSegmentSize" integer Required The size in bytes of
3683 that holds the values
3686 "GroupSegmentFixedSize" integer Required The amount of group
3690 bytes. This does not
3692 dynamically allocated
3693 group segment memory
3697 "PrivateSegmentFixedSize" integer Required The amount of fixed
3698 private address space
3699 memory required for a
3701 bytes. If the kernel
3703 stack then additional
3705 to this value for the
3707 "KernargSegmentAlign" integer Required The maximum byte
3710 kernarg segment. Must
3712 "WavefrontSize" integer Required Wavefront size. Must
3714 "NumSGPRs" integer Required Number of scalar
3718 includes the special
3720 Scratch (GFX7-GFX10)
3722 GFX8-GFX10). It does
3724 SGPR added if a trap
3730 "NumVGPRs" integer Required Number of vector
3734 "MaxFlatWorkGroupSize" integer Required Maximum flat
3737 kernel in work-items.
3740 ReqdWorkGroupSize if
3742 "NumSpilledSGPRs" integer Number of stores from
3743 a scalar register to
3744 a register allocator
3747 "NumSpilledVGPRs" integer Number of stores from
3748 a vector register to
3749 a register allocator
3752 ============================ ============== ========= =====================
3754 .. _amdgpu-amdhsa-code-object-metadata-v3:
3756 Code Object V3 Metadata
3757 +++++++++++++++++++++++
3760 Code object V3 generation is no longer supported by this version of LLVM.
3762 Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3763 record (see :ref:`amdgpu-note-records-v3-onwards`).
3765 The metadata is represented as Message Pack formatted binary data (see
3766 [MsgPack]_). The top level is a Message Pack map that includes the
3767 keys defined in table
3768 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3771 Additional information can be added to the maps. To avoid conflicts,
3772 any key names should be prefixed by "*vendor-name*." where
3773 ``vendor-name`` can be the name of the vendor and specific vendor
3774 tool that generates the information. The prefix is abbreviated to
3775 simply "." when it appears within a map that has been added by the
3778 .. table:: AMDHSA Code Object V3 Metadata Map
3779 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3781 ================= ============== ========= =======================================
3782 String Key Value Type Required? Description
3783 ================= ============== ========= =======================================
3784 "amdhsa.version" sequence of Required - The first integer is the major
3785 2 integers version. Currently 1.
3786 - The second integer is the minor
3787 version. Currently 0.
3788 "amdhsa.printf" sequence of Each string is encoded information
3789 strings about a printf function call. The
3790 encoded information is organized as
3791 fields separated by colon (':'):
3793 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3798 A 32-bit integer as a unique id for
3799 each printf function call
3802 A 32-bit integer equal to the number
3803 of arguments of printf function call
3806 ``S[i]`` (where i = 0, 1, ... , N-1)
3807 32-bit integers for the size in bytes
3808 of the i-th FormatString argument of
3809 the printf function call
3812 The format string passed to the
3813 printf function call.
3814 "amdhsa.kernels" sequence of Required Sequence of the maps for each
3815 map kernel in the code object. See
3816 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3817 for the definition of the keys included
3819 ================= ============== ========= =======================================
3823 .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3824 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3826 =================================== ============== ========= ================================
3827 String Key Value Type Required? Description
3828 =================================== ============== ========= ================================
3829 ".name" string Required Source name of the kernel.
3830 ".symbol" string Required Name of the kernel
3831 descriptor ELF symbol.
3832 ".language" string Source language of the kernel.
3842 ".language_version" sequence of - The first integer is the major
3844 - The second integer is the
3846 ".args" sequence of Sequence of maps of the
3847 map kernel arguments. See
3848 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3849 for the definition of the keys
3850 included in that map.
3851 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
3852 3 integers must be >=1 and the dispatch
3853 work-group size X, Y, Z must
3854 correspond to the specified
3855 values. Defaults to 0, 0, 0.
3857 Corresponds to the OpenCL
3858 ``reqd_work_group_size``
3860 ".workgroup_size_hint" sequence of The dispatch work-group size
3861 3 integers X, Y, Z is likely to be the
3864 Corresponds to the OpenCL
3865 ``work_group_size_hint``
3867 ".vec_type_hint" string The name of a scalar or vector
3870 Corresponds to the OpenCL
3871 ``vec_type_hint`` attribute.
3873 ".device_enqueue_symbol" string The external symbol name
3874 associated with a kernel.
3875 OpenCL runtime allocates a
3876 global buffer for the symbol
3877 and saves the kernel's address
3878 to it, which is used for
3879 device side enqueueing. Only
3880 available for device side
3882 ".kernarg_segment_size" integer Required The size in bytes of
3884 that holds the values
3887 ".group_segment_fixed_size" integer Required The amount of group
3891 bytes. This does not
3893 dynamically allocated
3894 group segment memory
3898 ".private_segment_fixed_size" integer Required The amount of fixed
3899 private address space
3900 memory required for a
3902 bytes. If the kernel
3904 stack then additional
3906 to this value for the
3908 ".kernarg_segment_align" integer Required The maximum byte
3911 kernarg segment. Must
3913 ".wavefront_size" integer Required Wavefront size. Must
3915 ".sgpr_count" integer Required Number of scalar
3916 registers required by a
3918 GFX6-GFX9. A register
3919 is required if it is
3921 if a higher numbered
3924 includes the special
3930 SGPR added if a trap
3936 ".vgpr_count" integer Required Number of vector
3937 registers required by
3939 GFX6-GFX9. A register
3940 is required if it is
3942 if a higher numbered
3945 ".agpr_count" integer Required Number of accumulator
3946 registers required by
3949 ".max_flat_workgroup_size" integer Required Maximum flat
3952 kernel in work-items.
3955 ReqdWorkGroupSize if
3957 ".sgpr_spill_count" integer Number of stores from
3958 a scalar register to
3959 a register allocator
3962 ".vgpr_spill_count" integer Number of stores from
3963 a vector register to
3964 a register allocator
3967 ".kind" string The kind of the kernel
3975 These kernels must be
3976 invoked after loading
3986 These kernels must be
3989 containing code object
3990 and after all init and
3991 normal kernels in the
3992 same code object have
3996 If omitted, "normal" is
3998 ".max_num_work_groups_{x,y,z}" integer The max number of
3999 launched work-groups
4001 dimensions. Each number
4003 =================================== ============== ========= ================================
4007 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
4008 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
4010 ====================== ============== ========= ================================
4011 String Key Value Type Required? Description
4012 ====================== ============== ========= ================================
4013 ".name" string Kernel argument name.
4014 ".type_name" string Kernel argument type name.
4015 ".size" integer Required Kernel argument size in bytes.
4016 ".offset" integer Required Kernel argument offset in
4017 bytes. The offset must be a
4018 multiple of the alignment
4019 required by the argument.
4020 ".value_kind" string Required Kernel argument kind that
4021 specifies how to set up the
4022 corresponding argument.
4026 The argument is copied
4027 directly into the kernarg.
4030 A global address space pointer
4031 to the buffer data is passed
4034 "dynamic_shared_pointer"
4035 A group address space pointer
4036 to dynamically allocated LDS
4037 is passed in the kernarg.
4040 A global address space
4041 pointer to a S# is passed in
4045 A global address space
4046 pointer to a T# is passed in
4050 A global address space pointer
4051 to an OpenCL pipe is passed in
4055 A global address space pointer
4056 to an OpenCL device enqueue
4057 queue is passed in the
4060 "hidden_global_offset_x"
4061 The OpenCL grid dispatch
4062 global offset for the X
4063 dimension is passed in the
4066 "hidden_global_offset_y"
4067 The OpenCL grid dispatch
4068 global offset for the Y
4069 dimension is passed in the
4072 "hidden_global_offset_z"
4073 The OpenCL grid dispatch
4074 global offset for the Z
4075 dimension is passed in the
4079 An argument that is not used
4080 by the kernel. Space needs to
4081 be left for it, but it does
4082 not need to be set up.
4084 "hidden_printf_buffer"
4085 A global address space pointer
4086 to the runtime printf buffer
4087 is passed in kernarg. Mutually
4089 "hidden_hostcall_buffer"
4090 before Code Object V5.
4092 "hidden_hostcall_buffer"
4093 A global address space pointer
4094 to the runtime hostcall buffer
4095 is passed in kernarg. Mutually
4097 "hidden_printf_buffer"
4098 before Code Object V5.
4100 "hidden_default_queue"
4101 A global address space pointer
4102 to the OpenCL device enqueue
4103 queue that should be used by
4104 the kernel by default is
4105 passed in the kernarg.
4107 "hidden_completion_action"
4108 A global address space pointer
4109 to help link enqueued kernels into
4110 the ancestor tree for determining
4111 when the parent kernel has finished.
4113 "hidden_multigrid_sync_arg"
4114 A global address space pointer for
4115 multi-grid synchronization is
4116 passed in the kernarg.
4118 ".value_type" string Unused and deprecated. This should no longer
4119 be emitted, but is accepted for compatibility.
4121 ".pointee_align" integer Alignment in bytes of pointee
4122 type for pointer type kernel
4123 argument. Must be a power
4124 of 2. Only present if
4126 "dynamic_shared_pointer".
4127 ".address_space" string Kernel argument address space
4128 qualifier. Only present if
4129 ".value_kind" is "global_buffer" or
4130 "dynamic_shared_pointer". Values
4142 Is "global_buffer" only "global"
4144 "dynamic_shared_pointer" always
4145 "local"? Can HCC allow "generic"?
4146 How can "private" or "region"
4149 ".access" string Kernel argument access
4150 qualifier. Only present if
4151 ".value_kind" is "image" or
4164 ".actual_access" string The actual memory accesses
4165 performed by the kernel on the
4166 kernel argument. Only present if
4167 ".value_kind" is "global_buffer",
4168 "image", or "pipe". This may be
4169 more restrictive than indicated
4170 by ".access" to reflect what the
4171 kernel actual does. If not
4172 present then the runtime must
4173 assume what is implied by
4174 ".access" and ".is_const" . Values
4181 ".is_const" boolean Indicates if the kernel argument
4182 is const qualified. Only present
4186 ".is_restrict" boolean Indicates if the kernel argument
4187 is restrict qualified. Only
4188 present if ".value_kind" is
4191 ".is_volatile" boolean Indicates if the kernel argument
4192 is volatile qualified. Only
4193 present if ".value_kind" is
4196 ".is_pipe" boolean Indicates if the kernel argument
4197 is pipe qualified. Only present
4198 if ".value_kind" is "pipe".
4202 Can "global_buffer" be pipe
4205 ====================== ============== ========= ================================
4207 .. _amdgpu-amdhsa-code-object-metadata-v4:
4209 Code Object V4 Metadata
4210 +++++++++++++++++++++++
4213 Code object V4 is not the default code object version emitted by this version
4216 Code object V4 metadata is the same as
4217 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
4218 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
4220 .. table:: AMDHSA Code Object V4 Metadata Map Changes
4221 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
4223 ================= ============== ========= =======================================
4224 String Key Value Type Required? Description
4225 ================= ============== ========= =======================================
4226 "amdhsa.version" sequence of Required - The first integer is the major
4227 2 integers version. Currently 1.
4228 - The second integer is the minor
4229 version. Currently 1.
4230 "amdhsa.target" string Required The target name of the code using the syntax:
4234 <target-triple> [ "-" <target-id> ]
4236 A canonical target ID must be
4237 used. See :ref:`amdgpu-target-triples`
4238 and :ref:`amdgpu-target-id`.
4239 ================= ============== ========= =======================================
4241 .. _amdgpu-amdhsa-code-object-metadata-v5:
4243 Code Object V5 Metadata
4244 +++++++++++++++++++++++
4246 Code object V5 metadata is the same as
4247 :ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
4248 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5`, table
4249 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5` and table
4250 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
4252 .. table:: AMDHSA Code Object V5 Metadata Map Changes
4253 :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
4255 ================= ============== ========= =======================================
4256 String Key Value Type Required? Description
4257 ================= ============== ========= =======================================
4258 "amdhsa.version" sequence of Required - The first integer is the major
4259 2 integers version. Currently 1.
4260 - The second integer is the minor
4261 version. Currently 2.
4262 ================= ============== ========= =======================================
4266 .. table:: AMDHSA Code Object V5 Kernel Metadata Map Additions
4267 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5
4269 ============================= ============= ========== =======================================
4270 String Key Value Type Required? Description
4271 ============================= ============= ========== =======================================
4272 ".uses_dynamic_stack" boolean Indicates if the generated machine code
4273 is using a dynamically sized stack.
4274 ".workgroup_processor_mode" boolean (GFX10+) Controls ENABLE_WGP_MODE in
4275 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
4276 ============================= ============= ========== =======================================
4280 .. table:: AMDHSA Code Object V5 Kernel Attribute Metadata Map
4281 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v5-table
4283 =========================== ============== ========= ==============================
4284 String Key Value Type Required? Description
4285 =========================== ============== ========= ==============================
4286 ".uniform_work_group_size" integer Indicates if the kernel
4287 requires that each dimension
4288 of global size is a multiple
4289 of corresponding dimension of
4290 work-group size. Value of 1
4291 implies true and value of 0
4292 implies false. Metadata is
4293 only emitted when value is 1.
4294 =========================== ============== ========= ==============================
4300 .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
4301 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
4303 ====================== ============== ========= ================================
4304 String Key Value Type Required? Description
4305 ====================== ============== ========= ================================
4306 ".value_kind" string Required Kernel argument kind that
4307 specifies how to set up the
4308 corresponding argument.
4310 the same as code object V3 metadata
4311 (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
4312 with the following additions:
4314 "hidden_block_count_x"
4315 The grid dispatch work-group count for the X dimension
4316 is passed in the kernarg. Some languages, such as OpenCL,
4317 support a last work-group in each dimension being partial.
4318 This count only includes the non-partial work-group count.
4319 This is not the same as the value in the AQL dispatch packet,
4320 which has the grid size in work-items.
4322 "hidden_block_count_y"
4323 The grid dispatch work-group count for the Y dimension
4324 is passed in the kernarg. Some languages, such as OpenCL,
4325 support a last work-group in each dimension being partial.
4326 This count only includes the non-partial work-group count.
4327 This is not the same as the value in the AQL dispatch packet,
4328 which has the grid size in work-items. If the grid dimensionality
4329 is 1, then must be 1.
4331 "hidden_block_count_z"
4332 The grid dispatch work-group count for the Z dimension
4333 is passed in the kernarg. Some languages, such as OpenCL,
4334 support a last work-group in each dimension being partial.
4335 This count only includes the non-partial work-group count.
4336 This is not the same as the value in the AQL dispatch packet,
4337 which has the grid size in work-items. If the grid dimensionality
4338 is 1 or 2, then must be 1.
4340 "hidden_group_size_x"
4341 The grid dispatch work-group size for the X dimension is
4342 passed in the kernarg. This size only applies to the
4343 non-partial work-groups. This is the same value as the AQL
4344 dispatch packet work-group size.
4346 "hidden_group_size_y"
4347 The grid dispatch work-group size for the Y dimension is
4348 passed in the kernarg. This size only applies to the
4349 non-partial work-groups. This is the same value as the AQL
4350 dispatch packet work-group size. If the grid dimensionality
4351 is 1, then must be 1.
4353 "hidden_group_size_z"
4354 The grid dispatch work-group size for the Z dimension is
4355 passed in the kernarg. This size only applies to the
4356 non-partial work-groups. This is the same value as the AQL
4357 dispatch packet work-group size. If the grid dimensionality
4358 is 1 or 2, then must be 1.
4360 "hidden_remainder_x"
4361 The grid dispatch work group size of the partial work group
4362 of the X dimension, if it exists. Must be zero if a partial
4363 work group does not exist in the X dimension.
4365 "hidden_remainder_y"
4366 The grid dispatch work group size of the partial work group
4367 of the Y dimension, if it exists. Must be zero if a partial
4368 work group does not exist in the Y dimension.
4370 "hidden_remainder_z"
4371 The grid dispatch work group size of the partial work group
4372 of the Z dimension, if it exists. Must be zero if a partial
4373 work group does not exist in the Z dimension.
4376 The grid dispatch dimensionality. This is the same value
4377 as the AQL dispatch packet dimensionality. Must be a value
4381 A global address space pointer to an initialized memory
4382 buffer that conforms to the requirements of the malloc/free
4383 device library V1 version implementation.
4385 "hidden_dynamic_lds_size"
4386 Size of the dynamically allocated LDS memory is passed in the kernarg.
4388 "hidden_private_base"
4389 The high 32 bits of the flat addressing private aperture base.
4390 Only used by GFX8 to allow conversion between private segment
4391 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4393 "hidden_shared_base"
4394 The high 32 bits of the flat addressing shared aperture base.
4395 Only used by GFX8 to allow conversion between shared segment
4396 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4399 A global memory address space pointer to the ROCm runtime
4400 ``struct amd_queue_t`` structure for the HSA queue of the
4401 associated dispatch AQL packet. It is only required for pre-GFX9
4402 devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
4404 ====================== ============== ========= ================================
4411 The HSA architected queuing language (AQL) defines a user space memory interface
4412 that can be used to control the dispatch of kernels, in an agent independent
4413 way. An agent can have zero or more AQL queues created for it using an HSA
4414 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
4415 are 64 bytes) can be placed. See the *HSA Platform System Architecture
4416 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
4418 The packet processor of a kernel agent is responsible for detecting and
4419 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
4420 packet processor is implemented by the hardware command processor (CP),
4421 asynchronous dispatch controller (ADC) and shader processor input controller
4424 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
4425 the kernel mode driver to initialize and register the AQL queue with CP.
4427 To dispatch a kernel the following actions are performed. This can occur in the
4428 CPU host program, or from an HSA kernel executing on a GPU.
4430 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
4431 executed is obtained.
4432 2. A pointer to the kernel descriptor (see
4433 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
4434 It must be for a kernel that is contained in a code object that was loaded
4435 by an HSA compatible runtime on the kernel agent with which the AQL queue is
4437 3. Space is allocated for the kernel arguments using the HSA compatible runtime
4438 allocator for a memory region with the kernarg property for the kernel agent
4439 that will execute the kernel. It must be at least 16-byte aligned.
4440 4. Kernel argument values are assigned to the kernel argument memory
4441 allocation. The layout is defined in the *HSA Programmer's Language
4442 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
4443 kernel argument memory in the same way constant memory is accessed. (Note
4444 that the HSA specification allows an implementation to copy the kernel
4445 argument contents to another location that is accessed by the kernel.)
4446 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
4447 runtime api uses 64-bit atomic operations to reserve space in the AQL queue
4448 for the packet. The packet must be set up, and the final write must use an
4449 atomic store release to set the packet kind to ensure the packet contents are
4450 visible to the kernel agent. AQL defines a doorbell signal mechanism to
4451 notify the kernel agent that the AQL queue has been updated. These rules, and
4452 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
4453 System Architecture Specification* [HSA]_.
4454 6. A kernel dispatch packet includes information about the actual dispatch,
4455 such as grid and work-group size, together with information from the code
4456 object about the kernel, such as segment sizes. The HSA compatible runtime
4457 queries on the kernel symbol can be used to obtain the code object values
4458 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
4459 7. CP executes micro-code and is responsible for detecting and setting up the
4460 GPU to execute the wavefronts of a kernel dispatch.
4461 8. CP ensures that when the a wavefront starts executing the kernel machine
4462 code, the scalar general purpose registers (SGPR) and vector general purpose
4463 registers (VGPR) are set up as required by the machine code. The required
4464 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
4465 register state is defined in
4466 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4467 9. The prolog of the kernel machine code (see
4468 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
4469 before continuing executing the machine code that corresponds to the kernel.
4470 10. When the kernel dispatch has completed execution, CP signals the completion
4471 signal specified in the kernel dispatch packet if not 0.
4473 .. _amdgpu-amdhsa-memory-spaces:
4478 The memory space properties are:
4480 .. table:: AMDHSA Memory Spaces
4481 :name: amdgpu-amdhsa-memory-spaces-table
4483 ================= =========== ======== ======= ==================
4484 Memory Space Name HSA Segment Hardware Address NULL Value
4486 ================= =========== ======== ======= ==================
4487 Private private scratch 32 0x00000000
4488 Local group LDS 32 0xFFFFFFFF
4489 Global global global 64 0x0000000000000000
4490 Constant constant *same as 64 0x0000000000000000
4492 Generic flat flat 64 0x0000000000000000
4493 Region N/A GDS 32 *not implemented
4495 ================= =========== ======== ======= ==================
4497 The global and constant memory spaces both use global virtual addresses, which
4498 are the same virtual address space used by the CPU. However, some virtual
4499 addresses may only be accessible to the CPU, some only accessible by the GPU,
4502 Using the constant memory space indicates that the data will not change during
4503 the execution of the kernel. This allows scalar read instructions to be
4504 used. The vector and scalar L1 caches are invalidated of volatile data before
4505 each kernel dispatch execution to allow constant memory to change values between
4508 The local memory space uses the hardware Local Data Store (LDS) which is
4509 automatically allocated when the hardware creates work-groups of wavefronts, and
4510 freed when all the wavefronts of a work-group have terminated. The data store
4511 (DS) instructions can be used to access it.
4513 The private memory space uses the hardware scratch memory support. If the kernel
4514 uses scratch, then the hardware allocates memory that is accessed using
4515 wavefront lane dword (4 byte) interleaving. The mapping used from private
4516 address to physical address is:
4518 ``wavefront-scratch-base +
4519 (private-address * wavefront-size * 4) +
4520 (wavefront-lane-id * 4)``
4522 There are different ways that the wavefront scratch base address is determined
4523 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
4524 memory can be accessed in an interleaved manner using buffer instruction with
4525 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
4526 instructions, or by flat instructions. If each lane of a wavefront accesses the
4527 same private address, the interleaving results in adjacent dwords being accessed
4528 and hence requires fewer cache lines to be fetched. Multi-dword access is not
4529 supported except by flat and scratch instructions in GFX9-GFX11.
4531 The generic address space uses the hardware flat address support available in
4532 GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
4533 local apertures), that are outside the range of addressible global memory, to
4534 map from a flat address to a private or local address.
4536 FLAT instructions can take a flat address and access global, private (scratch)
4537 and group (LDS) memory depending on if the address is within one of the
4538 aperture ranges. Flat access to scratch requires hardware aperture setup and
4539 setup in the kernel prologue (see
4540 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
4541 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
4542 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
4544 To convert between a segment address and a flat address the base address of the
4545 apertures address can be used. For GFX7-GFX8 these are available in the
4546 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
4547 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
4548 GFX9-GFX11 the aperture base addresses are directly available as inline constant
4549 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
4550 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
4551 which makes it easier to convert from flat to segment or segment to flat.
4556 Image and sample handles created by an HSA compatible runtime (see
4557 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
4558 object respectively. In order to support the HSA ``query_sampler`` operations
4559 two extra dwords are used to store the HSA BRIG enumeration values for the
4560 queries that are not trivially deducible from the S# representation.
4565 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
4566 are 64-bit addresses of a structure allocated in memory accessible from both the
4567 CPU and GPU. The structure is defined by the runtime and subject to change
4568 between releases. For example, see [AMD-ROCm-github]_.
4570 .. _amdgpu-amdhsa-hsa-aql-queue:
4575 The HSA AQL queue structure is defined by an HSA compatible runtime (see
4576 :ref:`amdgpu-os`) and subject to change between releases. For example, see
4577 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
4578 certain language features such as the flat address aperture bases. It also
4579 contains fields used by CP such as managing the allocation of scratch memory.
4581 .. _amdgpu-amdhsa-kernel-descriptor:
4586 A kernel descriptor consists of the information needed by CP to initiate the
4587 execution of a kernel, including the entry point address of the machine code
4588 that implements the kernel.
4590 Code Object V3 Kernel Descriptor
4591 ++++++++++++++++++++++++++++++++
4593 CP microcode requires the Kernel descriptor to be allocated on 64-byte
4596 The fields used by CP for code objects before V3 also match those specified in
4597 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
4599 .. table:: Code Object V3 Kernel Descriptor
4600 :name: amdgpu-amdhsa-kernel-descriptor-v3-table
4602 ======= ======= =============================== ============================
4603 Bits Size Field Name Description
4604 ======= ======= =============================== ============================
4605 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
4606 address space memory
4607 required for a work-group
4608 in bytes. This does not
4609 include any dynamically
4610 allocated local address
4611 space memory that may be
4612 added when the kernel is
4614 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
4615 private address space
4616 memory required for a
4617 work-item in bytes. When
4618 this cannot be predicted,
4619 code object v4 and older
4620 sets this value to be
4621 higher than the minimum
4623 95:64 4 bytes KERNARG_SIZE The size of the kernarg
4624 memory pointed to by the
4625 AQL dispatch packet. The
4626 kernarg memory is used to
4627 pass arguments to the
4630 * If the kernarg pointer in
4631 the dispatch packet is NULL
4632 then there are no kernel
4634 * If the kernarg pointer in
4635 the dispatch packet is
4636 not NULL and this value
4637 is 0 then the kernarg
4640 * If the kernarg pointer in
4641 the dispatch packet is
4642 not NULL and this value
4643 is not 0 then the value
4644 specifies the kernarg
4645 memory size in bytes. It
4646 is recommended to provide
4647 a value as it may be used
4648 by CP to optimize making
4650 visible to the kernel
4653 127:96 4 bytes Reserved, must be 0.
4654 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
4657 descriptor to kernel's
4658 entry point instruction
4659 which must be 256 byte
4661 351:192 20 Reserved, must be 0.
4663 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9
4664 Reserved, must be 0.
4667 program settings used by
4669 ``COMPUTE_PGM_RSRC3``
4672 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
4675 program settings used by
4677 ``COMPUTE_PGM_RSRC3``
4680 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
4683 program settings used by
4685 ``COMPUTE_PGM_RSRC3``
4688 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx12-table`.
4689 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
4690 program settings used by
4692 ``COMPUTE_PGM_RSRC1``
4695 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
4696 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
4697 program settings used by
4699 ``COMPUTE_PGM_RSRC2``
4702 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
4703 458:448 7 bits *See separate bits below.* Enable the setup of the
4704 SGPR user data registers
4706 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4708 The total number of SGPR
4710 requested must not exceed
4711 16 and match value in
4712 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
4713 Any requests beyond 16
4715 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties*
4717 :ref:`amdgpu-processor-table`
4718 specifies *Architected flat
4719 scratch* then not supported
4721 >449 1 bit ENABLE_SGPR_DISPATCH_PTR
4722 >450 1 bit ENABLE_SGPR_QUEUE_PTR
4723 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR
4724 >452 1 bit ENABLE_SGPR_DISPATCH_ID
4725 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties*
4727 :ref:`amdgpu-processor-table`
4728 specifies *Architected flat
4729 scratch* then not supported
4731 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT
4733 457:455 3 bits Reserved, must be 0.
4734 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9
4735 Reserved, must be 0.
4738 wavefront size 64 mode.
4740 native wavefront size
4742 459 1 bit USES_DYNAMIC_STACK Indicates if the generated
4743 machine code is using a
4744 dynamically sized stack.
4745 This is only set in code
4746 object v5 and later.
4747 463:460 4 bits Reserved, must be 0.
4748 470:464 7 bits KERNARG_PRELOAD_SPEC_LENGTH GFX6-GFX9
4749 - Reserved, must be 0.
4751 - The number of dwords from
4752 the kernarg segment to preload
4753 into User SGPRs before kernel
4755 :ref:`amdgpu-amdhsa-kernarg-preload`).
4756 479:471 9 bits KERNARG_PRELOAD_SPEC_OFFSET GFX6-GFX9
4757 - Reserved, must be 0.
4759 - An offset in dwords into the
4760 kernarg segment to begin
4761 preloading data into User
4763 :ref:`amdgpu-amdhsa-kernarg-preload`).
4764 511:480 4 bytes Reserved, must be 0.
4765 512 **Total size 64 bytes.**
4766 ======= ====================================================================
4770 .. table:: compute_pgm_rsrc1 for GFX6-GFX12
4771 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table
4773 ======= ======= =============================== ===========================================================================
4774 Bits Size Field Name Description
4775 ======= ======= =============================== ===========================================================================
4776 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
4777 blocks used by each work-item;
4778 granularity is device
4783 - max(0, ceil(vgprs_used / 4) - 1)
4786 - vgprs_used = align(arch_vgprs, 4)
4788 - max(0, ceil(vgprs_used / 8) - 1)
4789 GFX10-GFX11 (wavefront size 64)
4791 - max(0, ceil(vgprs_used / 4) - 1)
4792 GFX10-GFX11 (wavefront size 32)
4794 - max(0, ceil(vgprs_used / 8) - 1)
4796 Where vgprs_used is defined
4797 as the highest VGPR number
4798 explicitly referenced plus
4801 Used by CP to set up
4802 ``COMPUTE_PGM_RSRC1.VGPRS``.
4805 :ref:`amdgpu-assembler`
4807 automatically for the
4808 selected processor from
4809 values provided to the
4810 `.amdhsa_kernel` directive
4812 `.amdhsa_next_free_vgpr`
4813 nested directive (see
4814 :ref:`amdhsa-kernel-directives-table`).
4815 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4816 blocks used by a wavefront;
4817 granularity is device
4822 - max(0, ceil(sgprs_used / 8) - 1)
4825 - 2 * max(0, ceil(sgprs_used / 16) - 1)
4827 Reserved, must be 0.
4832 defined as the highest
4833 SGPR number explicitly
4834 referenced plus one, plus
4835 a target specific number
4836 of additional special
4838 FLAT_SCRATCH (GFX7+) and
4839 XNACK_MASK (GFX8+), and
4842 limitations. It does not
4843 include the 16 SGPRs added
4844 if a trap handler is
4848 limitations and special
4849 SGPR layout are defined in
4851 documentation, which can
4853 :ref:`amdgpu-processors`
4856 Used by CP to set up
4857 ``COMPUTE_PGM_RSRC1.SGPRS``.
4860 :ref:`amdgpu-assembler`
4862 automatically for the
4863 selected processor from
4864 values provided to the
4865 `.amdhsa_kernel` directive
4867 `.amdhsa_next_free_sgpr`
4868 and `.amdhsa_reserve_*`
4869 nested directives (see
4870 :ref:`amdhsa-kernel-directives-table`).
4871 11:10 2 bits PRIORITY Must be 0.
4873 Start executing wavefront
4874 at the specified priority.
4876 CP is responsible for
4878 ``COMPUTE_PGM_RSRC1.PRIORITY``.
4879 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
4880 with specified rounding
4883 precision floating point
4886 Floating point rounding
4887 mode values are defined in
4888 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4890 Used by CP to set up
4891 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4892 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
4893 with specified rounding
4894 denorm mode for half/double (16
4895 and 64-bit) floating point
4896 precision floating point
4899 Floating point rounding
4900 mode values are defined in
4901 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4903 Used by CP to set up
4904 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4905 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
4906 with specified denorm mode
4909 precision floating point
4912 Floating point denorm mode
4913 values are defined in
4914 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4916 Used by CP to set up
4917 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4918 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
4919 with specified denorm mode
4921 and 64-bit) floating point
4922 precision floating point
4925 Floating point denorm mode
4926 values are defined in
4927 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4929 Used by CP to set up
4930 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4931 20 1 bit PRIV Must be 0.
4933 Start executing wavefront
4934 in privilege trap handler
4937 CP is responsible for
4939 ``COMPUTE_PGM_RSRC1.PRIV``.
4940 21 1 bit ENABLE_DX10_CLAMP GFX9-GFX11
4941 Wavefront starts execution
4942 with DX10 clamp mode
4943 enabled. Used by the vector
4944 ALU to force DX10 style
4945 treatment of NaN's (when
4946 set, clamp NaN to zero,
4950 Used by CP to set up
4951 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4953 If 1, wavefronts are scheduled
4954 in a round-robin fashion with
4955 respect to the other wavefronts
4956 of the SIMD. Otherwise, wavefronts
4957 are scheduled in oldest age order.
4959 CP is responsible for filling in
4960 ``COMPUTE_PGM_RSRC1.WG_RR_EN``.
4961 22 1 bit DEBUG_MODE Must be 0.
4963 Start executing wavefront
4964 in single step mode.
4966 CP is responsible for
4968 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4969 23 1 bit ENABLE_IEEE_MODE GFX9-GFX11
4970 Wavefront starts execution
4972 enabled. Floating point
4973 opcodes that support
4974 exception flag gathering
4975 will quiet and propagate
4976 signaling-NaN inputs per
4977 IEEE 754-2008. Min_dx10 and
4978 max_dx10 become IEEE
4979 754-2008 compliant due to
4980 signaling-NaN propagation
4983 Used by CP to set up
4984 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4986 Reserved. Must be 0.
4987 24 1 bit BULKY Must be 0.
4989 Only one work-group allowed
4990 to execute on a compute
4993 CP is responsible for
4995 ``COMPUTE_PGM_RSRC1.BULKY``.
4996 25 1 bit CDBG_USER Must be 0.
4998 Flag that can be used to
4999 control debugging code.
5001 CP is responsible for
5003 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
5004 26 1 bit FP16_OVFL GFX6-GFX8
5005 Reserved, must be 0.
5007 Wavefront starts execution
5008 with specified fp16 overflow
5011 - If 0, fp16 overflow generates
5013 - If 1, fp16 overflow that is the
5014 result of an +/-INF input value
5015 or divide by 0 produces a +/-INF,
5016 otherwise clamps computed
5017 overflow to +/-MAX_FP16 as
5020 Used by CP to set up
5021 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
5022 28:27 2 bits Reserved, must be 0.
5023 29 1 bit WGP_MODE GFX6-GFX9
5024 Reserved, must be 0.
5026 - If 0 execute work-groups in
5027 CU wavefront execution mode.
5028 - If 1 execute work-groups on
5029 in WGP wavefront execution mode.
5031 See :ref:`amdgpu-amdhsa-memory-model`.
5033 Used by CP to set up
5034 ``COMPUTE_PGM_RSRC1.WGP_MODE``.
5035 30 1 bit MEM_ORDERED GFX6-GFX9
5036 Reserved, must be 0.
5038 Controls the behavior of the
5039 s_waitcnt's vmcnt and vscnt
5042 - If 0 vmcnt reports completion
5043 of load and atomic with return
5044 out of order with sample
5045 instructions, and the vscnt
5046 reports the completion of
5047 store and atomic without
5049 - If 1 vmcnt reports completion
5050 of load, atomic with return
5051 and sample instructions in
5052 order, and the vscnt reports
5053 the completion of store and
5054 atomic without return in order.
5056 Used by CP to set up
5057 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
5058 31 1 bit FWD_PROGRESS GFX6-GFX9
5059 Reserved, must be 0.
5061 - If 0 execute SIMD wavefronts
5062 using oldest first policy.
5063 - If 1 execute SIMD wavefronts to
5064 ensure wavefronts will make some
5067 Used by CP to set up
5068 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
5069 32 **Total size 4 bytes**
5070 ======= ===================================================================================================================
5074 .. table:: compute_pgm_rsrc2 for GFX6-GFX12
5075 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table
5077 ======= ======= =============================== ===========================================================================
5078 Bits Size Field Name Description
5079 ======= ======= =============================== ===========================================================================
5080 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the
5082 * If the *Target Properties*
5084 :ref:`amdgpu-processor-table`
5087 scratch* then enable the
5089 wavefront scratch offset
5090 system register (see
5091 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5092 * If the *Target Properties*
5094 :ref:`amdgpu-processor-table`
5095 specifies *Architected
5096 flat scratch* then enable
5098 FLAT_SCRATCH register
5100 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5102 Used by CP to set up
5103 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
5104 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
5106 registers requested. This
5107 number must be greater than
5108 or equal to the number of user
5109 data registers enabled.
5111 Used by CP to set up
5112 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
5113 6 1 bit ENABLE_TRAP_HANDLER GFX6-GFX11
5117 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
5118 which is set by the CP if
5119 the runtime has installed a
5122 Reserved, must be 0.
5123 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
5124 system SGPR register for
5125 the work-group id in the X
5127 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5129 Used by CP to set up
5130 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
5131 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
5132 system SGPR register for
5133 the work-group id in the Y
5135 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5137 Used by CP to set up
5138 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
5139 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
5140 system SGPR register for
5141 the work-group id in the Z
5143 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5145 Used by CP to set up
5146 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
5147 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
5148 system SGPR register for
5149 work-group information (see
5150 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5152 Used by CP to set up
5153 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
5154 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
5155 VGPR system registers used
5156 for the work-item ID.
5157 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
5160 Used by CP to set up
5161 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
5162 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
5164 Wavefront starts execution
5166 exceptions enabled which
5167 are generated when L1 has
5168 witnessed a thread access
5172 CP is responsible for
5173 filling in the address
5175 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
5176 according to what the
5178 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
5180 Wavefront starts execution
5181 with memory violation
5182 exceptions exceptions
5183 enabled which are generated
5184 when a memory violation has
5185 occurred for this wavefront from
5187 (write-to-read-only-memory,
5188 mis-aligned atomic, LDS
5189 address out of range,
5190 illegal address, etc.).
5194 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
5195 according to what the
5197 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
5199 CP uses the rounded value
5200 from the dispatch packet,
5201 not this value, as the
5202 dispatch may contain
5203 dynamically allocated group
5204 segment memory. CP writes
5206 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
5208 Amount of group segment
5209 (LDS) to allocate for each
5210 work-group. Granularity is
5214 roundup(lds-size / (64 * 4))
5216 roundup(lds-size / (128 * 4))
5218 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
5219 _INVALID_OPERATION with specified exceptions
5222 Used by CP to set up
5223 ``COMPUTE_PGM_RSRC2.EXCP_EN``
5224 (set from bits 0..6).
5228 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
5229 _SOURCE input operands is a
5231 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
5232 _DIVISION_BY_ZERO Zero
5233 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
5235 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
5237 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
5239 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
5240 _ZERO (rcp_iflag_f32 instruction
5242 31 1 bit RESERVED Reserved, must be 0.
5243 32 **Total size 4 bytes.**
5244 ======= ===================================================================================================================
5248 .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
5249 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
5251 ======= ======= =============================== ===========================================================================
5252 Bits Size Field Name Description
5253 ======= ======= =============================== ===========================================================================
5254 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4.
5255 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
5256 63 - accum-offset = 256.
5257 15:6 10 Reserved, must be 0.
5259 16 1 bit TG_SPLIT - If 0 the waves of a work-group are
5260 launched in the same CU.
5261 - If 1 the waves of a work-group can be
5262 launched in different CUs. The waves
5263 cannot use S_BARRIER or LDS.
5264 31:17 15 Reserved, must be 0.
5266 32 **Total size 4 bytes.**
5267 ======= ===================================================================================================================
5271 .. table:: compute_pgm_rsrc3 for GFX10-GFX11
5272 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table
5274 ======= ======= =============================== ===========================================================================
5275 Bits Size Field Name Description
5276 ======= ======= =============================== ===========================================================================
5277 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPR blocks when executing in subvector mode. For
5278 wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
5279 of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
5280 not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
5281 9:4 6 bits INST_PREF_SIZE GFX10
5282 Reserved, must be 0.
5284 Number of instruction bytes to prefetch, starting at the kernel's entry
5285 point instruction, before wavefront starts execution. The value is 0..63
5286 with a granularity of 128 bytes.
5287 10 1 bit TRAP_ON_START GFX10
5288 Reserved, must be 0.
5292 If 1, wavefront starts execution by trapping into the trap handler.
5294 CP is responsible for filling in the trap on start bit in
5295 ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
5297 11 1 bit TRAP_ON_END GFX10
5298 Reserved, must be 0.
5302 If 1, wavefront execution terminates by trapping into the trap handler.
5304 CP is responsible for filling in the trap on end bit in
5305 ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
5306 30:12 19 bits Reserved, must be 0.
5307 31 1 bit IMAGE_OP GFX10
5308 Reserved, must be 0.
5310 If 1, the kernel execution contains image instructions. If executed as
5311 part of a graphics pipeline, image read instructions will stall waiting
5312 for any necessary ``WAIT_SYNC`` fence to be performed in order to
5313 indicate that earlier pipeline stages have completed writing to the
5316 Not used for compute kernels that are not part of a graphics pipeline and
5318 32 **Total size 4 bytes.**
5319 ======= ===================================================================================================================
5323 .. table:: compute_pgm_rsrc3 for GFX12
5324 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx12-table
5326 ======= ======= =============================== ===========================================================================
5327 Bits Size Field Name Description
5328 ======= ======= =============================== ===========================================================================
5329 3:0 4 bits RESERVED Reserved, must be 0.
5330 11:4 8 bits INST_PREF_SIZE Number of instruction bytes to prefetch, starting at the kernel's entry
5331 point instruction, before wavefront starts execution. The value is 0..255
5332 with a granularity of 128 bytes.
5333 12 1 bit RESERVED Reserved, must be 0.
5334 13 1 bit GLG_EN If 1, group launch guarantee will be enabled for this dispatch
5335 30:14 17 bits RESERVED Reserved, must be 0.
5336 31 1 bit IMAGE_OP If 1, the kernel execution contains image instructions. If executed as
5337 part of a graphics pipeline, image read instructions will stall waiting
5338 for any necessary ``WAIT_SYNC`` fence to be performed in order to
5339 indicate that earlier pipeline stages have completed writing to the
5342 Not used for compute kernels that are not part of a graphics pipeline and
5344 32 **Total size 4 bytes.**
5345 ======= ===================================================================================================================
5349 .. table:: Floating Point Rounding Mode Enumeration Values
5350 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
5352 ====================================== ===== ==============================
5353 Enumeration Name Value Description
5354 ====================================== ===== ==============================
5355 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
5356 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
5357 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
5358 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
5359 ====================================== ===== ==============================
5362 .. table:: Extended FLT_ROUNDS Enumeration Values
5363 :name: amdgpu-rounding-mode-enumeration-values-table
5365 +------------------------+---------------+-------------------+--------------------+----------+
5366 | | F32 NEAR_EVEN | F32 PLUS_INFINITY | F32 MINUS_INFINITY | F32 ZERO |
5367 +------------------------+---------------+-------------------+--------------------+----------+
5368 | F64/F16 NEAR_EVEN | 1 | 11 | 14 | 17 |
5369 +------------------------+---------------+-------------------+--------------------+----------+
5370 | F64/F16 PLUS_INFINITY | 8 | 2 | 15 | 18 |
5371 +------------------------+---------------+-------------------+--------------------+----------+
5372 | F64/F16 MINUS_INFINITY | 9 | 12 | 3 | 19 |
5373 +------------------------+---------------+-------------------+--------------------+----------+
5374 | F64/F16 ZERO | 10 | 13 | 16 | 0 |
5375 +------------------------+---------------+-------------------+--------------------+----------+
5379 .. table:: Floating Point Denorm Mode Enumeration Values
5380 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
5382 ====================================== ===== ====================================
5383 Enumeration Name Value Description
5384 ====================================== ===== ====================================
5385 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination Denorms
5386 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
5387 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
5388 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
5389 ====================================== ===== ====================================
5391 Denormal flushing is sign respecting. i.e. the behavior expected by
5392 ``"denormal-fp-math"="preserve-sign"``. The behavior is undefined with
5393 ``"denormal-fp-math"="positive-zero"``
5397 .. table:: System VGPR Work-Item ID Enumeration Values
5398 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
5400 ======================================== ===== ============================
5401 Enumeration Name Value Description
5402 ======================================== ===== ============================
5403 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
5405 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
5407 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
5409 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
5410 ======================================== ===== ============================
5412 .. _amdgpu-amdhsa-initial-kernel-execution-state:
5414 Initial Kernel Execution State
5415 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5417 This section defines the register state that will be set up by the packet
5418 processor prior to the start of execution of every wavefront. This is limited by
5419 the constraints of the hardware controllers of CP/ADC/SPI.
5421 The order of the SGPR registers is defined, but the compiler can specify which
5422 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
5423 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
5424 for enabled registers are dense starting at SGPR0: the first enabled register is
5425 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
5428 The initial SGPRs comprise up to 16 User SGPRs that are set by CP and apply to
5429 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
5430 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
5431 actually initialized. These are then immediately followed by the System SGPRs
5432 that are set up by ADC/SPI and can have different values for each wavefront of
5435 SGPR register initial state is defined in
5436 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
5438 .. table:: SGPR Register Set Up Order
5439 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
5441 ========== ========================== ====== ==============================
5442 SGPR Order Name Number Description
5443 (kernel descriptor enable of
5445 ========== ========================== ====== ==============================
5446 First Private Segment Buffer 4 See
5447 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
5449 then Dispatch Ptr 2 64-bit address of AQL dispatch
5450 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
5452 then Queue Ptr 2 64-bit address of amd_queue_t
5453 (enable_sgpr_queue_ptr) object for AQL queue on which
5454 the dispatch packet was
5456 then Kernarg Segment Ptr 2 64-bit address of Kernarg
5457 (enable_sgpr_kernarg segment. This is directly
5458 _segment_ptr) copied from the
5459 kernarg_address in the kernel
5462 Having CP load it once avoids
5463 loading it at the beginning of
5465 then Dispatch Id 2 64-bit Dispatch ID of the
5466 (enable_sgpr_dispatch_id) dispatch packet being
5468 then Flat Scratch Init 2 See
5469 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5471 then Preloaded Kernargs N/A See
5472 (kernarg_preload_spec :ref:`amdgpu-amdhsa-kernarg-preload`.
5474 then Private Segment Size 1 The 32-bit byte size of a
5475 (enable_sgpr_private single work-item's memory
5476 _segment_size) allocation. This is the
5477 value from the kernel
5478 dispatch packet Private
5479 Segment Byte Size rounded up
5480 by CP to a multiple of
5483 Having CP load it once avoids
5484 loading it at the beginning of
5487 This is not used for
5488 GFX7-GFX8 since it is the same
5489 value as the second SGPR of
5490 Flat Scratch Init. However, it
5491 may be needed for GFX9-GFX11 which
5492 changes the meaning of the
5493 Flat Scratch Init value.
5494 then Work-Group Id X 1 32-bit work-group id in X
5495 (enable_sgpr_workgroup_id dimension of grid for
5497 then Work-Group Id Y 1 32-bit work-group id in Y
5498 (enable_sgpr_workgroup_id dimension of grid for
5500 then Work-Group Id Z 1 32-bit work-group id in Z
5501 (enable_sgpr_workgroup_id dimension of grid for
5503 then Work-Group Info 1 {first_wavefront, 14'b0000,
5504 (enable_sgpr_workgroup ordered_append_term[10:0],
5505 _info) threadgroup_size_in_wavefronts[5:0]}
5506 then Scratch Wavefront Offset 1 See
5507 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5508 _segment_wavefront_offset) and
5509 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
5510 ========== ========================== ====== ==============================
5512 The order of the VGPR registers is defined, but the compiler can specify which
5513 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
5514 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
5515 for enabled registers are dense starting at VGPR0: the first enabled register is
5516 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
5519 There are different methods used for the VGPR initial state:
5521 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
5522 specifies otherwise, a separate VGPR register is used per work-item ID. The
5523 VGPR register initial state for this method is defined in
5524 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
5525 * If *Target Properties* column of :ref:`amdgpu-processor-table`
5526 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
5527 for all work-item IDs. The register layout for this method is defined in
5528 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
5530 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
5531 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
5533 ========== ========================== ====== ==============================
5534 VGPR Order Name Number Description
5535 (kernel descriptor enable of
5537 ========== ========================== ====== ==============================
5538 First Work-Item Id X 1 32-bit work-item id in X
5539 (Always initialized) dimension of work-group for
5541 then Work-Item Id Y 1 32-bit work-item id in Y
5542 (enable_vgpr_workitem_id dimension of work-group for
5543 > 0) wavefront lane.
5544 then Work-Item Id Z 1 32-bit work-item id in Z
5545 (enable_vgpr_workitem_id dimension of work-group for
5546 > 1) wavefront lane.
5547 ========== ========================== ====== ==============================
5551 .. table:: Register Layout for Packed Work-Item ID Method
5552 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
5554 ======= ======= ================ =========================================
5555 Bits Size Field Name Description
5556 ======= ======= ================ =========================================
5557 0:9 10 bits Work-Item Id X Work-item id in X
5558 dimension of work-group for
5563 10:19 10 bits Work-Item Id Y Work-item id in Y
5564 dimension of work-group for
5567 Initialized if enable_vgpr_workitem_id >
5568 0, otherwise set to 0.
5569 20:29 10 bits Work-Item Id Z Work-item id in Z
5570 dimension of work-group for
5573 Initialized if enable_vgpr_workitem_id >
5574 1, otherwise set to 0.
5575 30:31 2 bits Reserved, set to 0.
5576 ======= ======= ================ =========================================
5578 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
5580 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
5582 2. Work-group Id registers X, Y, Z are set by ADC which supports any
5583 combination including none.
5584 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
5585 its value cannot be included with the flat scratch init value which is per
5586 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
5587 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
5589 5. Flat Scratch register pair initialization is described in
5590 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5592 The global segment can be accessed either using buffer instructions (GFX6 which
5593 has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
5594 instructions (GFX9-GFX11).
5596 If buffer operations are used, then the compiler can generate a V# with the
5597 following properties:
5601 * ATC: 1 if IOMMU present (such as APU)
5603 * MTYPE set to support memory coherence that matches the runtime (such as CC for
5604 APU and NC for dGPU).
5606 .. _amdgpu-amdhsa-kernarg-preload:
5608 Preloaded Kernel Arguments
5609 ++++++++++++++++++++++++++
5611 On hardware that supports this feature, kernel arguments can be preloaded into
5612 User SGPRs, up to the maximum number of User SGPRs available. The allocation of
5613 Preload SGPRs occurs directly after the last enabled non-kernarg preload User
5614 SGPR. (See :ref:`amdgpu-amdhsa-initial-kernel-execution-state`)
5616 The data preloaded is copied from the kernarg segment, the amount of data is
5617 determined by the value specified in the kernarg_preload_spec_length field of
5618 the kernel descriptor. This data is then loaded into consecutive User SGPRs. The
5619 number of SGPRs receiving preloaded kernarg data corresponds with the value
5620 given by kernarg_preload_spec_length. The preloading starts at the dword offset
5621 within the kernarg segment, which is specified by the
5622 kernarg_preload_spec_offset field.
5624 If the kernarg_preload_spec_length is non-zero, the CP firmware will append an
5625 additional 256 bytes to the kernel_code_entry_byte_offset. This addition
5626 facilitates the incorporation of a prologue to the kernel entry to handle cases
5627 where code designed for kernarg preloading is executed on hardware equipped with
5628 incompatible firmware. If hardware has compatible firmware the 256 bytes at the
5629 start of the kernel entry will be skipped. Additionally, the compiler backend
5630 may insert a trap instruction at the start of the kernel prologue to manage
5631 situations where kernarg preloading is attempted on hardware with incompatible
5634 .. _amdgpu-amdhsa-kernel-prolog:
5639 The compiler performs initialization in the kernel prologue depending on the
5640 target and information about things like stack usage in the kernel and called
5641 functions. Some of this initialization requires the compiler to request certain
5642 User and System SGPRs be present in the
5643 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
5644 :ref:`amdgpu-amdhsa-kernel-descriptor`.
5646 .. _amdgpu-amdhsa-kernel-prolog-cfi:
5651 1. The CFI return address is undefined.
5653 2. The CFI CFA is defined using an expression which evaluates to a location
5654 description that comprises one memory location description for the
5655 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
5657 .. _amdgpu-amdhsa-kernel-prolog-m0:
5663 The M0 register must be initialized with a value at least the total LDS size
5664 if the kernel may access LDS via DS or flat operations. Total LDS size is
5665 available in dispatch packet. For M0, it is also possible to use maximum
5666 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
5669 The M0 register is not used for range checking LDS accesses and so does not
5670 need to be initialized in the prolog.
5672 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
5677 If the kernel has function calls it must set up the ABI stack pointer described
5678 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
5679 SGPR32 to the unswizzled scratch offset of the address past the last local
5682 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
5687 If the kernel needs a frame pointer for the reasons defined in
5688 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
5689 kernel prolog. If a frame pointer is not required then all uses of the frame
5690 pointer are replaced with immediate ``0`` offsets.
5692 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
5697 There are different methods used for initializing flat scratch:
5699 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5700 specifies *Does not support generic address space*:
5702 Flat scratch is not supported and there is no flat scratch register pair.
5704 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5705 specifies *Offset flat scratch*:
5707 If the kernel or any function it calls may use flat operations to access
5708 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
5709 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
5710 Scratch Wavefront Offset SGPR registers (see
5711 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
5713 1. The low word of Flat Scratch Init is the 32-bit byte offset from
5714 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
5715 being managed by SPI for the queue executing the kernel dispatch. This is
5716 the same value used in the Scratch Segment Buffer V# base address.
5718 CP obtains this from the runtime. (The Scratch Segment Buffer base address
5719 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
5721 The prolog must add the value of Scratch Wavefront Offset to get the
5722 wavefront's byte scratch backing memory offset from
5723 ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
5725 The Scratch Wavefront Offset must also be used as an offset with Private
5726 segment address when using the Scratch Segment Buffer.
5728 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
5729 shifted by 8 before moving into FLAT_SCRATCH_HI.
5731 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
5732 SGPRn is the highest numbered SGPR allocated to the wavefront).
5733 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
5734 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
5735 FLAT SCRATCH BASE in flat memory instructions that access the scratch
5737 2. The second word of Flat Scratch Init is 32-bit byte size of a single
5738 work-items scratch memory usage.
5740 CP obtains this from the runtime, and it is always a multiple of DWORD. CP
5741 checks that the value in the kernel dispatch packet Private Segment Byte
5742 Size is not larger and requests the runtime to increase the queue's scratch
5745 CP directly loads from the kernel dispatch packet Private Segment Byte Size
5746 field and rounds up to a multiple of DWORD. Having CP load it once avoids
5747 loading it at the beginning of every wavefront.
5749 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
5750 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
5751 in flat memory instructions.
5753 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5754 specifies *Absolute flat scratch*:
5756 If the kernel or any function it calls may use flat operations to access
5757 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
5758 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
5759 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
5760 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
5762 The Flat Scratch Init is the 64-bit address of the base of scratch backing
5763 memory being managed by SPI for the queue executing the kernel dispatch.
5765 CP obtains this from the runtime.
5767 The kernel prolog must add the value of the wave's Scratch Wavefront Offset
5768 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
5769 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
5770 memory instructions.
5772 The Scratch Wavefront Offset must also be used as an offset with Private
5773 segment address when using the Scratch Segment Buffer (see
5774 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
5776 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5777 specifies *Architected flat scratch*:
5779 If ENABLE_PRIVATE_SEGMENT is enabled in
5780 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table` then the FLAT_SCRATCH
5781 register pair will be initialized to the 64-bit address of the base of scratch
5782 backing memory being managed by SPI for the queue executing the kernel
5783 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
5784 flat scratch base in flat memory instructions.
5786 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
5788 Private Segment Buffer
5789 ++++++++++++++++++++++
5791 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
5792 *Architected flat scratch* then a Private Segment Buffer is not supported.
5793 Instead the flat SCRATCH instructions are used.
5795 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
5796 that are used as a V# to access scratch. CP uses the value provided by the
5797 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
5798 access the private memory space using a segment address. See
5799 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
5801 The scratch V# is a four-aligned SGPR and always selected for the kernel as
5804 - If it is known during instruction selection that there is stack usage,
5805 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
5806 optimizations are disabled (``-O0``), if stack objects already exist (for
5807 locals, etc.), or if there are any function calls.
5809 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
5810 are reserved for the tentative scratch V#. These will be used if it is
5811 determined that spilling is needed.
5813 - If no use is made of the tentative scratch V#, then it is unreserved,
5814 and the register count is determined ignoring it.
5815 - If use is made of the tentative scratch V#, then its register numbers
5816 are shifted to the first four-aligned SGPR index after the highest one
5817 allocated by the register allocator, and all uses are updated. The
5818 register count includes them in the shifted location.
5819 - In either case, if the processor has the SGPR allocation bug, the
5820 tentative allocation is not shifted or unreserved in order to ensure
5821 the register count is higher to workaround the bug.
5825 This approach of using a tentative scratch V# and shifting the register
5826 numbers if used avoids having to perform register allocation a second
5827 time if the tentative V# is eliminated. This is more efficient and
5828 avoids the problem that the second register allocation may perform
5829 spilling which will fail as there is no longer a scratch V#.
5831 When the kernel prolog code is being emitted it is known whether the scratch V#
5832 described above is actually used. If it is, the prolog code must set it up by
5833 copying the Private Segment Buffer to the scratch V# registers and then adding
5834 the Private Segment Wavefront Offset to the queue base address in the V#. The
5835 result is a V# with a base address pointing to the beginning of the wavefront
5836 scratch backing memory.
5838 The Private Segment Buffer is always requested, but the Private Segment
5839 Wavefront Offset is only requested if it is used (see
5840 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5842 .. _amdgpu-amdhsa-memory-model:
5847 This section describes the mapping of the LLVM memory model onto AMDGPU machine
5848 code (see :ref:`memmodel`).
5850 The AMDGPU backend supports the memory synchronization scopes specified in
5851 :ref:`amdgpu-memory-scopes`.
5853 The code sequences used to implement the memory model specify the order of
5854 instructions that a single thread must execute. The ``s_waitcnt`` and cache
5855 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
5856 to other memory instructions executed by the same thread. This allows them to be
5857 moved earlier or later which can allow them to be combined with other instances
5858 of the same instruction, or hoisted/sunk out of loops to improve performance.
5859 Only the instructions related to the memory model are given; additional
5860 ``s_waitcnt`` instructions are required to ensure registers are defined before
5861 being used. These may be able to be combined with the memory model ``s_waitcnt``
5862 instructions as described above.
5864 The AMDGPU backend supports the following memory models:
5866 HSA Memory Model [HSA]_
5867 The HSA memory model uses a single happens-before relation for all address
5868 spaces (see :ref:`amdgpu-address-spaces`).
5869 OpenCL Memory Model [OpenCL]_
5870 The OpenCL memory model which has separate happens-before relations for the
5871 global and local address spaces. Only a fence specifying both global and
5872 local address space, and seq_cst instructions join the relationships. Since
5873 the LLVM ``memfence`` instruction does not allow an address space to be
5874 specified the OpenCL fence has to conservatively assume both local and
5875 global address space was specified. However, optimizations can often be
5876 done to eliminate the additional ``s_waitcnt`` instructions when there are
5877 no intervening memory instructions which access the corresponding address
5878 space. The code sequences in the table indicate what can be omitted for the
5879 OpenCL memory. The target triple environment is used to determine if the
5880 source language is OpenCL (see :ref:`amdgpu-opencl`).
5882 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
5885 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
5886 termed vector memory operations.
5888 Private address space uses ``buffer_load/store`` using the scratch V#
5889 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
5890 is accessing the memory, atomic memory orderings are not meaningful, and all
5891 accesses are treated as non-atomic.
5893 Constant address space uses ``buffer/global_load`` instructions (or equivalent
5894 scalar memory instructions). Since the constant address space contents do not
5895 change during the execution of a kernel dispatch it is not legal to perform
5896 stores, and atomic memory orderings are not meaningful, and all accesses are
5897 treated as non-atomic.
5899 A memory synchronization scope wider than work-group is not meaningful for the
5900 group (LDS) address space and is treated as work-group.
5902 The memory model does not support the region address space which is treated as
5905 Acquire memory ordering is not meaningful on store atomic instructions and is
5906 treated as non-atomic.
5908 Release memory ordering is not meaningful on load atomic instructions and is
5909 treated a non-atomic.
5911 Acquire-release memory ordering is not meaningful on load or store atomic
5912 instructions and is treated as acquire and release respectively.
5914 The memory order also adds the single thread optimization constraints defined in
5916 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
5918 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
5919 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
5921 ============ ==============================================================
5922 LLVM Memory Optimization Constraints
5924 ============ ==============================================================
5927 acquire - If a load atomic/atomicrmw then no following load/load
5928 atomic/store/store atomic/atomicrmw/fence instruction can be
5929 moved before the acquire.
5930 - If a fence then same as load atomic, plus no preceding
5931 associated fence-paired-atomic can be moved after the fence.
5932 release - If a store atomic/atomicrmw then no preceding load/load
5933 atomic/store/store atomic/atomicrmw/fence instruction can be
5934 moved after the release.
5935 - If a fence then same as store atomic, plus no following
5936 associated fence-paired-atomic can be moved before the
5938 acq_rel Same constraints as both acquire and release.
5939 seq_cst - If a load atomic then same constraints as acquire, plus no
5940 preceding sequentially consistent load atomic/store
5941 atomic/atomicrmw/fence instruction can be moved after the
5943 - If a store atomic then the same constraints as release, plus
5944 no following sequentially consistent load atomic/store
5945 atomic/atomicrmw/fence instruction can be moved before the
5947 - If an atomicrmw/fence then same constraints as acq_rel.
5948 ============ ==============================================================
5950 The code sequences used to implement the memory model are defined in the
5953 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5954 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5955 * :ref:`amdgpu-amdhsa-memory-model-gfx942`
5956 * :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
5958 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5960 Memory Model GFX6-GFX9
5961 ++++++++++++++++++++++
5965 * Each agent has multiple shader arrays (SA).
5966 * Each SA has multiple compute units (CU).
5967 * Each CU has multiple SIMDs that execute wavefronts.
5968 * The wavefronts for a single work-group are executed in the same CU but may be
5969 executed by different SIMDs.
5970 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
5972 * All LDS operations of a CU are performed as wavefront wide operations in a
5973 global order and involve no caching. Completion is reported to a wavefront in
5975 * The LDS memory has multiple request queues shared by the SIMDs of a
5976 CU. Therefore, the LDS operations performed by different wavefronts of a
5977 work-group can be reordered relative to each other, which can result in
5978 reordering the visibility of vector memory operations with respect to LDS
5979 operations of other wavefronts in the same work-group. A ``s_waitcnt
5980 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5981 vector memory operations between wavefronts of a work-group, but not between
5982 operations performed by the same wavefront.
5983 * The vector memory operations are performed as wavefront wide operations and
5984 completion is reported to a wavefront in execution order. The exception is
5985 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5986 vector memory order if they access LDS memory, and out of LDS operation order
5987 if they access global memory.
5988 * The vector memory operations access a single vector L1 cache shared by all
5989 SIMDs a CU. Therefore, no special action is required for coherence between the
5990 lanes of a single wavefront, or for coherence between wavefronts in the same
5991 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5992 wavefronts executing in different work-groups as they may be executing on
5994 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
5995 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5996 scalar operations are used in a restricted way so do not impact the memory
5997 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5998 * The vector and scalar memory operations use an L2 cache shared by all CUs on
6000 * The L2 cache has independent channels to service disjoint ranges of virtual
6002 * Each CU has a separate request queue per channel. Therefore, the vector and
6003 scalar memory operations performed by wavefronts executing in different
6004 work-groups (which may be executing on different CUs) of an agent can be
6005 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
6006 ensure synchronization between vector memory operations of different CUs. It
6007 ensures a previous vector memory operation has completed before executing a
6008 subsequent vector memory or LDS operation and so can be used to meet the
6009 requirements of acquire and release.
6010 * The L2 cache can be kept coherent with other agents on some targets, or ranges
6011 of virtual addresses can be set up to bypass it to ensure system coherence.
6013 Scalar memory operations are only used to access memory that is proven to not
6014 change during the execution of the kernel dispatch. This includes constant
6015 address space and global address space for program scope ``const`` variables.
6016 Therefore, the kernel machine code does not have to maintain the scalar cache to
6017 ensure it is coherent with the vector caches. The scalar and vector caches are
6018 invalidated between kernel dispatches by CP since constant address space data
6019 may change between kernel dispatch executions. See
6020 :ref:`amdgpu-amdhsa-memory-spaces`.
6022 The one exception is if scalar writes are used to spill SGPR registers. In this
6023 case the AMDGPU backend ensures the memory location used to spill is never
6024 accessed by vector memory operations at the same time. If scalar writes are used
6025 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6026 return since the locations may be used for vector memory instructions by a
6027 future wavefront that uses the same scratch area, or a function call that
6028 creates a frame at the same address, respectively. There is no need for a
6029 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6031 For kernarg backing memory:
6033 * CP invalidates the L1 cache at the start of each kernel dispatch.
6034 * On dGPU the kernarg backing memory is allocated in host memory accessed as
6035 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
6036 causes it to be treated as non-volatile and so is not invalidated by
6038 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
6039 and so the L2 cache will be coherent with the CPU and other agents.
6041 Scratch backing memory (which is used for the private address space) is accessed
6042 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6043 only accessed by a single thread, and is always write-before-read, there is
6044 never a need to invalidate these entries from the L1 cache. Hence all cache
6045 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6047 The code sequences used to implement the memory model for GFX6-GFX9 are defined
6048 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
6050 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
6051 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
6053 ============ ============ ============== ========== ================================
6054 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
6055 Ordering Sync Scope Address GFX6-GFX9
6057 ============ ============ ============== ========== ================================
6059 ------------------------------------------------------------------------------------
6060 load *none* *none* - global - !volatile & !nontemporal
6062 - private 1. buffer/global/flat_load
6064 - !volatile & nontemporal
6066 1. buffer/global/flat_load
6071 1. buffer/global/flat_load
6073 2. s_waitcnt vmcnt(0)
6075 - Must happen before
6076 any following volatile
6087 load *none* *none* - local 1. ds_load
6088 store *none* *none* - global - !volatile & !nontemporal
6090 - private 1. buffer/global/flat_store
6092 - !volatile & nontemporal
6094 1. buffer/global/flat_store
6099 1. buffer/global/flat_store
6100 2. s_waitcnt vmcnt(0)
6102 - Must happen before
6103 any following volatile
6114 store *none* *none* - local 1. ds_store
6115 **Unordered Atomic**
6116 ------------------------------------------------------------------------------------
6117 load atomic unordered *any* *any* *Same as non-atomic*.
6118 store atomic unordered *any* *any* *Same as non-atomic*.
6119 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
6120 **Monotonic Atomic**
6121 ------------------------------------------------------------------------------------
6122 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load
6124 - workgroup - generic
6125 load atomic monotonic - agent - global 1. buffer/global/flat_load
6126 - system - generic glc=1
6127 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
6128 - wavefront - generic
6132 store atomic monotonic - singlethread - local 1. ds_store
6135 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
6136 - wavefront - generic
6140 atomicrmw monotonic - singlethread - local 1. ds_atomic
6144 ------------------------------------------------------------------------------------
6145 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
6148 load atomic acquire - workgroup - global 1. buffer/global_load
6149 load atomic acquire - workgroup - local 1. ds/flat_load
6150 - generic 2. s_waitcnt lgkmcnt(0)
6153 - Must happen before
6162 older than a local load
6166 load atomic acquire - agent - global 1. buffer/global_load
6168 2. s_waitcnt vmcnt(0)
6170 - Must happen before
6178 3. buffer_wbinvl1_vol
6180 - Must happen before
6190 load atomic acquire - agent - generic 1. flat_load glc=1
6191 - system 2. s_waitcnt vmcnt(0) &
6196 - Must happen before
6199 - Ensures the flat_load
6204 3. buffer_wbinvl1_vol
6206 - Must happen before
6216 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
6219 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
6220 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
6221 - generic 2. s_waitcnt lgkmcnt(0)
6224 - Must happen before
6237 atomicrmw acquire - agent - global 1. buffer/global_atomic
6238 - system 2. s_waitcnt vmcnt(0)
6240 - Must happen before
6249 3. buffer_wbinvl1_vol
6251 - Must happen before
6261 atomicrmw acquire - agent - generic 1. flat_atomic
6262 - system 2. s_waitcnt vmcnt(0) &
6267 - Must happen before
6276 3. buffer_wbinvl1_vol
6278 - Must happen before
6288 fence acquire - singlethread *none* *none*
6290 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6295 - However, since LLVM
6320 fence-paired-atomic).
6321 - Must happen before
6332 fence-paired-atomic.
6334 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
6341 - However, since LLVM
6349 - Could be split into
6358 - s_waitcnt vmcnt(0)
6369 fence-paired-atomic).
6370 - s_waitcnt lgkmcnt(0)
6381 fence-paired-atomic).
6382 - Must happen before
6396 fence-paired-atomic.
6398 2. buffer_wbinvl1_vol
6400 - Must happen before any
6401 following global/generic
6411 ------------------------------------------------------------------------------------
6412 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
6415 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
6424 - Must happen before
6435 2. buffer/global/flat_store
6436 store atomic release - workgroup - local 1. ds_store
6437 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
6438 - system - generic vmcnt(0)
6444 - Could be split into
6453 - s_waitcnt vmcnt(0)
6460 - s_waitcnt lgkmcnt(0)
6467 - Must happen before
6478 2. buffer/global/flat_store
6479 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
6482 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
6491 - Must happen before
6502 2. buffer/global/flat_atomic
6503 atomicrmw release - workgroup - local 1. ds_atomic
6504 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
6505 - system - generic vmcnt(0)
6509 - Could be split into
6518 - s_waitcnt vmcnt(0)
6525 - s_waitcnt lgkmcnt(0)
6532 - Must happen before
6543 2. buffer/global/flat_atomic
6544 fence release - singlethread *none* *none*
6546 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6551 - However, since LLVM
6572 - Must happen before
6581 fence-paired-atomic).
6588 fence-paired-atomic.
6590 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
6601 - However, since LLVM
6616 - Could be split into
6625 - s_waitcnt vmcnt(0)
6632 - s_waitcnt lgkmcnt(0)
6639 - Must happen before
6648 fence-paired-atomic).
6655 fence-paired-atomic.
6657 **Acquire-Release Atomic**
6658 ------------------------------------------------------------------------------------
6659 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
6662 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
6671 - Must happen before
6682 2. buffer/global_atomic
6684 atomicrmw acq_rel - workgroup - local 1. ds_atomic
6685 2. s_waitcnt lgkmcnt(0)
6688 - Must happen before
6697 older than the local load
6701 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
6710 - Must happen before
6722 3. s_waitcnt lgkmcnt(0)
6725 - Must happen before
6734 older than a local load
6738 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
6743 - Could be split into
6752 - s_waitcnt vmcnt(0)
6759 - s_waitcnt lgkmcnt(0)
6766 - Must happen before
6777 2. buffer/global_atomic
6778 3. s_waitcnt vmcnt(0)
6780 - Must happen before
6789 4. buffer_wbinvl1_vol
6791 - Must happen before
6801 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
6806 - Could be split into
6815 - s_waitcnt vmcnt(0)
6822 - s_waitcnt lgkmcnt(0)
6829 - Must happen before
6841 3. s_waitcnt vmcnt(0) &
6846 - Must happen before
6855 4. buffer_wbinvl1_vol
6857 - Must happen before
6867 fence acq_rel - singlethread *none* *none*
6869 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6889 - Must happen before
6912 acquire-fence-paired-atomic)
6933 release-fence-paired-atomic).
6938 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
6945 - However, since LLVM
6953 - Could be split into
6962 - s_waitcnt vmcnt(0)
6969 - s_waitcnt lgkmcnt(0)
6976 - Must happen before
6981 global/local/generic
6990 acquire-fence-paired-atomic)
7002 global/local/generic
7011 release-fence-paired-atomic).
7016 2. buffer_wbinvl1_vol
7018 - Must happen before
7032 **Sequential Consistent Atomic**
7033 ------------------------------------------------------------------------------------
7034 load atomic seq_cst - singlethread - global *Same as corresponding
7035 - wavefront - local load atomic acquire,
7036 - generic except must generate
7037 all instructions even
7039 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
7055 lgkmcnt(0) and so do
7087 order. The s_waitcnt
7088 could be placed after
7092 make the s_waitcnt be
7099 instructions same as
7102 except must generate
7103 all instructions even
7105 load atomic seq_cst - workgroup - local *Same as corresponding
7106 load atomic acquire,
7107 except must generate
7108 all instructions even
7111 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
7112 - system - generic vmcnt(0)
7114 - Could be split into
7123 - s_waitcnt lgkmcnt(0)
7136 lgkmcnt(0) and so do
7139 - s_waitcnt vmcnt(0)
7184 order. The s_waitcnt
7185 could be placed after
7189 make the s_waitcnt be
7196 instructions same as
7199 except must generate
7200 all instructions even
7202 store atomic seq_cst - singlethread - global *Same as corresponding
7203 - wavefront - local store atomic release,
7204 - workgroup - generic except must generate
7205 - agent all instructions even
7206 - system for OpenCL.*
7207 atomicrmw seq_cst - singlethread - global *Same as corresponding
7208 - wavefront - local atomicrmw acq_rel,
7209 - workgroup - generic except must generate
7210 - agent all instructions even
7211 - system for OpenCL.*
7212 fence seq_cst - singlethread *none* *Same as corresponding
7213 - wavefront fence acq_rel,
7214 - workgroup except must generate
7215 - agent all instructions even
7216 - system for OpenCL.*
7217 ============ ============ ============== ========== ================================
7219 .. _amdgpu-amdhsa-memory-model-gfx90a:
7226 * Each agent has multiple shader arrays (SA).
7227 * Each SA has multiple compute units (CU).
7228 * Each CU has multiple SIMDs that execute wavefronts.
7229 * The wavefronts for a single work-group are executed in the same CU but may be
7230 executed by different SIMDs. The exception is when in tgsplit execution mode
7231 when the wavefronts may be executed by different SIMDs in different CUs.
7232 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
7233 executing on it. The exception is when in tgsplit execution mode when no LDS
7234 is allocated as wavefronts of the same work-group can be in different CUs.
7235 * All LDS operations of a CU are performed as wavefront wide operations in a
7236 global order and involve no caching. Completion is reported to a wavefront in
7238 * The LDS memory has multiple request queues shared by the SIMDs of a
7239 CU. Therefore, the LDS operations performed by different wavefronts of a
7240 work-group can be reordered relative to each other, which can result in
7241 reordering the visibility of vector memory operations with respect to LDS
7242 operations of other wavefronts in the same work-group. A ``s_waitcnt
7243 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
7244 vector memory operations between wavefronts of a work-group, but not between
7245 operations performed by the same wavefront.
7246 * The vector memory operations are performed as wavefront wide operations and
7247 completion is reported to a wavefront in execution order. The exception is
7248 that ``flat_load/store/atomic`` instructions can report out of vector memory
7249 order if they access LDS memory, and out of LDS operation order if they access
7251 * The vector memory operations access a single vector L1 cache shared by all
7252 SIMDs a CU. Therefore:
7254 * No special action is required for coherence between the lanes of a single
7257 * No special action is required for coherence between wavefronts in the same
7258 work-group since they execute on the same CU. The exception is when in
7259 tgsplit execution mode as wavefronts of the same work-group can be in
7260 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
7263 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
7264 executing in different work-groups as they may be executing on different
7267 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
7268 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
7269 scalar operations are used in a restricted way so do not impact the memory
7270 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
7271 * The vector and scalar memory operations use an L2 cache shared by all CUs on
7274 * The L2 cache has independent channels to service disjoint ranges of virtual
7276 * Each CU has a separate request queue per channel. Therefore, the vector and
7277 scalar memory operations performed by wavefronts executing in different
7278 work-groups (which may be executing on different CUs), or the same
7279 work-group if executing in tgsplit mode, of an agent can be reordered
7280 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
7281 synchronization between vector memory operations of different CUs. It
7282 ensures a previous vector memory operation has completed before executing a
7283 subsequent vector memory or LDS operation and so can be used to meet the
7284 requirements of acquire and release.
7285 * The L2 cache of one agent can be kept coherent with other agents by:
7286 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
7287 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
7288 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
7290 * Any local memory cache lines will be automatically invalidated by writes
7291 from CUs associated with other L2 caches, or writes from the CPU, due to
7292 the cache probe caused by coherent requests. Coherent requests are caused
7293 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
7294 XGMI, and by PCIe requests that are configured to be coherent requests.
7295 * XGMI accesses from the CPU to local memory may be cached on the CPU.
7296 Subsequent access from the GPU will automatically invalidate or writeback
7297 the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
7298 * Since all work-groups on the same agent share the same L2, no L2
7299 invalidation or writeback is required for coherence.
7300 * To ensure coherence of local and remote memory writes of work-groups in
7301 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
7302 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
7303 ()used for remote coarse grain memory). Note that MTYPE CC (used for local
7304 fine grain memory) causes write through to DRAM, and MTYPE UC (used for
7305 remote fine grain memory) bypasses the L2, so both will never result in
7306 dirty L2 cache lines.
7307 * To ensure coherence of local and remote memory reads of work-groups in
7308 different agents a ``buffer_invl2`` is required. It will invalidate L2
7309 cache lines with MTYPE NC (used for remote coarse grain memory). Note that
7310 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
7311 coarse memory) cause local reads to be invalidated by remote writes with
7312 with the PTE C-bit so these cache lines are not invalidated. Note that
7313 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
7314 never result in L2 cache lines that need to be invalidated.
7316 * PCIe access from the GPU to the CPU memory is kept coherent by using the
7317 MTYPE UC (uncached) which bypasses the L2.
7319 Scalar memory operations are only used to access memory that is proven to not
7320 change during the execution of the kernel dispatch. This includes constant
7321 address space and global address space for program scope ``const`` variables.
7322 Therefore, the kernel machine code does not have to maintain the scalar cache to
7323 ensure it is coherent with the vector caches. The scalar and vector caches are
7324 invalidated between kernel dispatches by CP since constant address space data
7325 may change between kernel dispatch executions. See
7326 :ref:`amdgpu-amdhsa-memory-spaces`.
7328 The one exception is if scalar writes are used to spill SGPR registers. In this
7329 case the AMDGPU backend ensures the memory location used to spill is never
7330 accessed by vector memory operations at the same time. If scalar writes are used
7331 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
7332 return since the locations may be used for vector memory instructions by a
7333 future wavefront that uses the same scratch area, or a function call that
7334 creates a frame at the same address, respectively. There is no need for a
7335 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
7337 For kernarg backing memory:
7339 * CP invalidates the L1 cache at the start of each kernel dispatch.
7340 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
7341 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
7342 cache. This also causes it to be treated as non-volatile and so is not
7343 invalidated by ``*_vol``.
7344 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
7345 so the L2 cache will be coherent with the CPU and other agents.
7347 Scratch backing memory (which is used for the private address space) is accessed
7348 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
7349 only accessed by a single thread, and is always write-before-read, there is
7350 never a need to invalidate these entries from the L1 cache. Hence all cache
7351 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
7353 The code sequences used to implement the memory model for GFX90A are defined
7354 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
7356 .. table:: AMDHSA Memory Model Code Sequences GFX90A
7357 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
7359 ============ ============ ============== ========== ================================
7360 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
7361 Ordering Sync Scope Address GFX90A
7363 ============ ============ ============== ========== ================================
7365 ------------------------------------------------------------------------------------
7366 load *none* *none* - global - !volatile & !nontemporal
7368 - private 1. buffer/global/flat_load
7370 - !volatile & nontemporal
7372 1. buffer/global/flat_load
7377 1. buffer/global/flat_load
7379 2. s_waitcnt vmcnt(0)
7381 - Must happen before
7382 any following volatile
7393 load *none* *none* - local 1. ds_load
7394 store *none* *none* - global - !volatile & !nontemporal
7396 - private 1. buffer/global/flat_store
7398 - !volatile & nontemporal
7400 1. buffer/global/flat_store
7405 1. buffer/global/flat_store
7406 2. s_waitcnt vmcnt(0)
7408 - Must happen before
7409 any following volatile
7420 store *none* *none* - local 1. ds_store
7421 **Unordered Atomic**
7422 ------------------------------------------------------------------------------------
7423 load atomic unordered *any* *any* *Same as non-atomic*.
7424 store atomic unordered *any* *any* *Same as non-atomic*.
7425 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
7426 **Monotonic Atomic**
7427 ------------------------------------------------------------------------------------
7428 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
7429 - wavefront - generic
7430 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
7433 - If not TgSplit execution
7436 load atomic monotonic - singlethread - local *If TgSplit execution mode,
7437 - wavefront local address space cannot
7438 - workgroup be used.*
7441 load atomic monotonic - agent - global 1. buffer/global/flat_load
7443 load atomic monotonic - system - global 1. buffer/global/flat_load
7445 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
7446 - wavefront - generic
7449 store atomic monotonic - system - global 1. buffer/global/flat_store
7451 store atomic monotonic - singlethread - local *If TgSplit execution mode,
7452 - wavefront local address space cannot
7453 - workgroup be used.*
7456 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
7457 - wavefront - generic
7460 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
7462 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
7463 - wavefront local address space cannot
7464 - workgroup be used.*
7468 ------------------------------------------------------------------------------------
7469 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
7472 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
7474 - If not TgSplit execution
7477 2. s_waitcnt vmcnt(0)
7479 - If not TgSplit execution
7481 - Must happen before the
7482 following buffer_wbinvl1_vol.
7484 3. buffer_wbinvl1_vol
7486 - If not TgSplit execution
7488 - Must happen before
7499 load atomic acquire - workgroup - local *If TgSplit execution mode,
7500 local address space cannot
7504 2. s_waitcnt lgkmcnt(0)
7507 - Must happen before
7516 older than the local load
7520 load atomic acquire - workgroup - generic 1. flat_load glc=1
7522 - If not TgSplit execution
7525 2. s_waitcnt lgkm/vmcnt(0)
7527 - Use lgkmcnt(0) if not
7528 TgSplit execution mode
7529 and vmcnt(0) if TgSplit
7531 - If OpenCL, omit lgkmcnt(0).
7532 - Must happen before
7534 buffer_wbinvl1_vol and any
7535 following global/generic
7542 older than a local load
7546 3. buffer_wbinvl1_vol
7548 - If not TgSplit execution
7555 load atomic acquire - agent - global 1. buffer/global_load
7557 2. s_waitcnt vmcnt(0)
7559 - Must happen before
7567 3. buffer_wbinvl1_vol
7569 - Must happen before
7579 load atomic acquire - system - global 1. buffer/global/flat_load
7581 2. s_waitcnt vmcnt(0)
7583 - Must happen before
7584 following buffer_invl2 and
7594 - Must happen before
7602 stale L1 global data,
7603 nor see stale L2 MTYPE
7605 MTYPE RW and CC memory will
7606 never be stale in L2 due to
7609 load atomic acquire - agent - generic 1. flat_load glc=1
7610 2. s_waitcnt vmcnt(0) &
7613 - If TgSplit execution mode,
7617 - Must happen before
7620 - Ensures the flat_load
7625 3. buffer_wbinvl1_vol
7627 - Must happen before
7637 load atomic acquire - system - generic 1. flat_load glc=1
7638 2. s_waitcnt vmcnt(0) &
7641 - If TgSplit execution mode,
7645 - Must happen before
7649 - Ensures the flat_load
7657 - Must happen before
7665 stale L1 global data,
7666 nor see stale L2 MTYPE
7668 MTYPE RW and CC memory will
7669 never be stale in L2 due to
7672 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
7673 - wavefront - generic
7674 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
7675 - wavefront local address space cannot
7679 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
7680 2. s_waitcnt vmcnt(0)
7682 - If not TgSplit execution
7684 - Must happen before the
7685 following buffer_wbinvl1_vol.
7686 - Ensures the atomicrmw
7691 3. buffer_wbinvl1_vol
7693 - If not TgSplit execution
7695 - Must happen before
7705 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
7706 local address space cannot
7710 2. s_waitcnt lgkmcnt(0)
7713 - Must happen before
7722 older than the local
7726 atomicrmw acquire - workgroup - generic 1. flat_atomic
7727 2. s_waitcnt lgkm/vmcnt(0)
7729 - Use lgkmcnt(0) if not
7730 TgSplit execution mode
7731 and vmcnt(0) if TgSplit
7733 - If OpenCL, omit lgkmcnt(0).
7734 - Must happen before
7736 buffer_wbinvl1_vol and
7749 3. buffer_wbinvl1_vol
7751 - If not TgSplit execution
7758 atomicrmw acquire - agent - global 1. buffer/global_atomic
7759 2. s_waitcnt vmcnt(0)
7761 - Must happen before
7770 3. buffer_wbinvl1_vol
7772 - Must happen before
7782 atomicrmw acquire - system - global 1. buffer/global_atomic
7783 2. s_waitcnt vmcnt(0)
7785 - Must happen before
7786 following buffer_invl2 and
7797 - Must happen before
7805 stale L1 global data,
7806 nor see stale L2 MTYPE
7808 MTYPE RW and CC memory will
7809 never be stale in L2 due to
7812 atomicrmw acquire - agent - generic 1. flat_atomic
7813 2. s_waitcnt vmcnt(0) &
7816 - If TgSplit execution mode,
7820 - Must happen before
7829 3. buffer_wbinvl1_vol
7831 - Must happen before
7841 atomicrmw acquire - system - generic 1. flat_atomic
7842 2. s_waitcnt vmcnt(0) &
7845 - If TgSplit execution mode,
7849 - Must happen before
7862 - Must happen before
7870 stale L1 global data,
7871 nor see stale L2 MTYPE
7873 MTYPE RW and CC memory will
7874 never be stale in L2 due to
7877 fence acquire - singlethread *none* *none*
7879 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7881 - Use lgkmcnt(0) if not
7882 TgSplit execution mode
7883 and vmcnt(0) if TgSplit
7893 - However, since LLVM
7908 - s_waitcnt vmcnt(0)
7920 fence-paired-atomic).
7921 - s_waitcnt lgkmcnt(0)
7932 fence-paired-atomic).
7933 - Must happen before
7935 buffer_wbinvl1_vol and
7946 fence-paired-atomic.
7948 2. buffer_wbinvl1_vol
7950 - If not TgSplit execution
7957 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
7960 - If TgSplit execution mode,
7966 - However, since LLVM
7974 - Could be split into
7983 - s_waitcnt vmcnt(0)
7994 fence-paired-atomic).
7995 - s_waitcnt lgkmcnt(0)
8006 fence-paired-atomic).
8007 - Must happen before
8021 fence-paired-atomic.
8023 2. buffer_wbinvl1_vol
8025 - Must happen before any
8026 following global/generic
8035 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
8038 - If TgSplit execution mode,
8044 - However, since LLVM
8052 - Could be split into
8061 - s_waitcnt vmcnt(0)
8072 fence-paired-atomic).
8073 - s_waitcnt lgkmcnt(0)
8084 fence-paired-atomic).
8085 - Must happen before
8086 the following buffer_invl2 and
8099 fence-paired-atomic.
8104 - Must happen before any
8105 following global/generic
8112 stale L1 global data,
8113 nor see stale L2 MTYPE
8115 MTYPE RW and CC memory will
8116 never be stale in L2 due to
8119 ------------------------------------------------------------------------------------
8120 store atomic release - singlethread - global 1. buffer/global/flat_store
8121 - wavefront - generic
8122 store atomic release - singlethread - local *If TgSplit execution mode,
8123 - wavefront local address space cannot
8127 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8129 - Use lgkmcnt(0) if not
8130 TgSplit execution mode
8131 and vmcnt(0) if TgSplit
8133 - If OpenCL, omit lgkmcnt(0).
8134 - s_waitcnt vmcnt(0)
8137 global/generic load/store/
8138 load atomic/store atomic/
8140 - s_waitcnt lgkmcnt(0)
8147 - Must happen before
8158 2. buffer/global/flat_store
8159 store atomic release - workgroup - local *If TgSplit execution mode,
8160 local address space cannot
8164 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
8167 - If TgSplit execution mode,
8173 - Could be split into
8182 - s_waitcnt vmcnt(0)
8189 - s_waitcnt lgkmcnt(0)
8196 - Must happen before
8207 2. buffer/global/flat_store
8208 store atomic release - system - global 1. buffer_wbl2
8210 - Must happen before
8211 following s_waitcnt.
8212 - Performs L2 writeback to
8216 visible at system scope.
8218 2. s_waitcnt lgkmcnt(0) &
8221 - If TgSplit execution mode,
8227 - Could be split into
8236 - s_waitcnt vmcnt(0)
8237 must happen after any
8243 - s_waitcnt lgkmcnt(0)
8244 must happen after any
8250 - Must happen before
8255 to memory and the L2
8262 3. buffer/global/flat_store
8263 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
8264 - wavefront - generic
8265 atomicrmw release - singlethread - local *If TgSplit execution mode,
8266 - wavefront local address space cannot
8270 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8272 - Use lgkmcnt(0) if not
8273 TgSplit execution mode
8274 and vmcnt(0) if TgSplit
8278 - s_waitcnt vmcnt(0)
8281 global/generic load/store/
8282 load atomic/store atomic/
8284 - s_waitcnt lgkmcnt(0)
8291 - Must happen before
8302 2. buffer/global/flat_atomic
8303 atomicrmw release - workgroup - local *If TgSplit execution mode,
8304 local address space cannot
8308 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
8311 - If TgSplit execution mode,
8315 - Could be split into
8324 - s_waitcnt vmcnt(0)
8331 - s_waitcnt lgkmcnt(0)
8338 - Must happen before
8349 2. buffer/global/flat_atomic
8350 atomicrmw release - system - global 1. buffer_wbl2
8352 - Must happen before
8353 following s_waitcnt.
8354 - Performs L2 writeback to
8358 visible at system scope.
8360 2. s_waitcnt lgkmcnt(0) &
8363 - If TgSplit execution mode,
8367 - Could be split into
8376 - s_waitcnt vmcnt(0)
8383 - s_waitcnt lgkmcnt(0)
8390 - Must happen before
8395 to memory and the L2
8402 3. buffer/global/flat_atomic
8403 fence release - singlethread *none* *none*
8405 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
8407 - Use lgkmcnt(0) if not
8408 TgSplit execution mode
8409 and vmcnt(0) if TgSplit
8419 - However, since LLVM
8434 - s_waitcnt vmcnt(0)
8439 load atomic/store atomic/
8441 - s_waitcnt lgkmcnt(0)
8448 - Must happen before
8457 fence-paired-atomic).
8464 fence-paired-atomic.
8466 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
8469 - If TgSplit execution mode,
8479 - However, since LLVM
8494 - Could be split into
8503 - s_waitcnt vmcnt(0)
8510 - s_waitcnt lgkmcnt(0)
8517 - Must happen before
8526 fence-paired-atomic).
8533 fence-paired-atomic.
8535 fence release - system *none* 1. buffer_wbl2
8540 - Must happen before
8541 following s_waitcnt.
8542 - Performs L2 writeback to
8546 visible at system scope.
8548 2. s_waitcnt lgkmcnt(0) &
8551 - If TgSplit execution mode,
8561 - However, since LLVM
8576 - Could be split into
8585 - s_waitcnt vmcnt(0)
8592 - s_waitcnt lgkmcnt(0)
8599 - Must happen before
8608 fence-paired-atomic).
8615 fence-paired-atomic.
8617 **Acquire-Release Atomic**
8618 ------------------------------------------------------------------------------------
8619 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
8620 - wavefront - generic
8621 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
8622 - wavefront local address space cannot
8626 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8628 - Use lgkmcnt(0) if not
8629 TgSplit execution mode
8630 and vmcnt(0) if TgSplit
8640 - s_waitcnt vmcnt(0)
8643 global/generic load/store/
8644 load atomic/store atomic/
8646 - s_waitcnt lgkmcnt(0)
8653 - Must happen before
8664 2. buffer/global_atomic
8665 3. s_waitcnt vmcnt(0)
8667 - If not TgSplit execution
8669 - Must happen before
8679 4. buffer_wbinvl1_vol
8681 - If not TgSplit execution
8688 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
8689 local address space cannot
8693 2. s_waitcnt lgkmcnt(0)
8696 - Must happen before
8705 older than the local load
8709 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
8711 - Use lgkmcnt(0) if not
8712 TgSplit execution mode
8713 and vmcnt(0) if TgSplit
8717 - s_waitcnt vmcnt(0)
8720 global/generic load/store/
8721 load atomic/store atomic/
8723 - s_waitcnt lgkmcnt(0)
8730 - Must happen before
8742 3. s_waitcnt lgkmcnt(0) &
8745 - If not TgSplit execution
8746 mode, omit vmcnt(0).
8749 - Must happen before
8751 buffer_wbinvl1_vol and
8760 older than a local load
8764 3. buffer_wbinvl1_vol
8766 - If not TgSplit execution
8773 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
8776 - If TgSplit execution mode,
8780 - Could be split into
8789 - s_waitcnt vmcnt(0)
8796 - s_waitcnt lgkmcnt(0)
8803 - Must happen before
8814 2. buffer/global_atomic
8815 3. s_waitcnt vmcnt(0)
8817 - Must happen before
8826 4. buffer_wbinvl1_vol
8828 - Must happen before
8838 atomicrmw acq_rel - system - global 1. buffer_wbl2
8840 - Must happen before
8841 following s_waitcnt.
8842 - Performs L2 writeback to
8846 visible at system scope.
8848 2. s_waitcnt lgkmcnt(0) &
8851 - If TgSplit execution mode,
8855 - Could be split into
8864 - s_waitcnt vmcnt(0)
8871 - s_waitcnt lgkmcnt(0)
8878 - Must happen before
8883 to global and L2 writeback
8884 have completed before
8889 3. buffer/global_atomic
8890 4. s_waitcnt vmcnt(0)
8892 - Must happen before
8893 following buffer_invl2 and
8904 - Must happen before
8912 stale L1 global data,
8913 nor see stale L2 MTYPE
8915 MTYPE RW and CC memory will
8916 never be stale in L2 due to
8919 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
8922 - If TgSplit execution mode,
8926 - Could be split into
8935 - s_waitcnt vmcnt(0)
8942 - s_waitcnt lgkmcnt(0)
8949 - Must happen before
8961 3. s_waitcnt vmcnt(0) &
8964 - If TgSplit execution mode,
8968 - Must happen before
8977 4. buffer_wbinvl1_vol
8979 - Must happen before
8989 atomicrmw acq_rel - system - generic 1. buffer_wbl2
8991 - Must happen before
8992 following s_waitcnt.
8993 - Performs L2 writeback to
8997 visible at system scope.
8999 2. s_waitcnt lgkmcnt(0) &
9002 - If TgSplit execution mode,
9006 - Could be split into
9015 - s_waitcnt vmcnt(0)
9022 - s_waitcnt lgkmcnt(0)
9029 - Must happen before
9034 to global and L2 writeback
9035 have completed before
9041 4. s_waitcnt vmcnt(0) &
9044 - If TgSplit execution mode,
9048 - Must happen before
9049 following buffer_invl2 and
9060 - Must happen before
9068 stale L1 global data,
9069 nor see stale L2 MTYPE
9071 MTYPE RW and CC memory will
9072 never be stale in L2 due to
9075 fence acq_rel - singlethread *none* *none*
9077 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
9079 - Use lgkmcnt(0) if not
9080 TgSplit execution mode
9081 and vmcnt(0) if TgSplit
9100 - s_waitcnt vmcnt(0)
9105 load atomic/store atomic/
9107 - s_waitcnt lgkmcnt(0)
9114 - Must happen before
9137 acquire-fence-paired-atomic)
9158 release-fence-paired-atomic).
9162 - Must happen before
9166 acquire-fence-paired
9167 atomic has completed
9176 acquire-fence-paired-atomic.
9178 2. buffer_wbinvl1_vol
9180 - If not TgSplit execution
9187 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
9190 - If TgSplit execution mode,
9196 - However, since LLVM
9204 - Could be split into
9213 - s_waitcnt vmcnt(0)
9220 - s_waitcnt lgkmcnt(0)
9227 - Must happen before
9232 global/local/generic
9241 acquire-fence-paired-atomic)
9253 global/local/generic
9262 release-fence-paired-atomic).
9267 2. buffer_wbinvl1_vol
9269 - Must happen before
9283 fence acq_rel - system *none* 1. buffer_wbl2
9288 - Must happen before
9289 following s_waitcnt.
9290 - Performs L2 writeback to
9294 visible at system scope.
9296 2. s_waitcnt lgkmcnt(0) &
9299 - If TgSplit execution mode,
9305 - However, since LLVM
9313 - Could be split into
9322 - s_waitcnt vmcnt(0)
9329 - s_waitcnt lgkmcnt(0)
9336 - Must happen before
9337 the following buffer_invl2 and
9341 global/local/generic
9350 acquire-fence-paired-atomic)
9362 global/local/generic
9371 release-fence-paired-atomic).
9379 - Must happen before
9388 stale L1 global data,
9389 nor see stale L2 MTYPE
9391 MTYPE RW and CC memory will
9392 never be stale in L2 due to
9395 **Sequential Consistent Atomic**
9396 ------------------------------------------------------------------------------------
9397 load atomic seq_cst - singlethread - global *Same as corresponding
9398 - wavefront - local load atomic acquire,
9399 - generic except must generate
9400 all instructions even
9402 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
9404 - Use lgkmcnt(0) if not
9405 TgSplit execution mode
9406 and vmcnt(0) if TgSplit
9408 - s_waitcnt lgkmcnt(0) must
9421 lgkmcnt(0) and so do
9424 - s_waitcnt vmcnt(0)
9443 consistent global/local
9469 order. The s_waitcnt
9470 could be placed after
9474 make the s_waitcnt be
9481 instructions same as
9484 except must generate
9485 all instructions even
9487 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
9488 local address space cannot
9491 *Same as corresponding
9492 load atomic acquire,
9493 except must generate
9494 all instructions even
9497 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
9498 - system - generic vmcnt(0)
9500 - If TgSplit execution mode,
9502 - Could be split into
9511 - s_waitcnt lgkmcnt(0)
9524 lgkmcnt(0) and so do
9527 - s_waitcnt vmcnt(0)
9572 order. The s_waitcnt
9573 could be placed after
9577 make the s_waitcnt be
9584 instructions same as
9587 except must generate
9588 all instructions even
9590 store atomic seq_cst - singlethread - global *Same as corresponding
9591 - wavefront - local store atomic release,
9592 - workgroup - generic except must generate
9593 - agent all instructions even
9594 - system for OpenCL.*
9595 atomicrmw seq_cst - singlethread - global *Same as corresponding
9596 - wavefront - local atomicrmw acq_rel,
9597 - workgroup - generic except must generate
9598 - agent all instructions even
9599 - system for OpenCL.*
9600 fence seq_cst - singlethread *none* *Same as corresponding
9601 - wavefront fence acq_rel,
9602 - workgroup except must generate
9603 - agent all instructions even
9604 - system for OpenCL.*
9605 ============ ============ ============== ========== ================================
9607 .. _amdgpu-amdhsa-memory-model-gfx942:
9614 * Each agent has multiple shader arrays (SA).
9615 * Each SA has multiple compute units (CU).
9616 * Each CU has multiple SIMDs that execute wavefronts.
9617 * The wavefronts for a single work-group are executed in the same CU but may be
9618 executed by different SIMDs. The exception is when in tgsplit execution mode
9619 when the wavefronts may be executed by different SIMDs in different CUs.
9620 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
9621 executing on it. The exception is when in tgsplit execution mode when no LDS
9622 is allocated as wavefronts of the same work-group can be in different CUs.
9623 * All LDS operations of a CU are performed as wavefront wide operations in a
9624 global order and involve no caching. Completion is reported to a wavefront in
9626 * The LDS memory has multiple request queues shared by the SIMDs of a
9627 CU. Therefore, the LDS operations performed by different wavefronts of a
9628 work-group can be reordered relative to each other, which can result in
9629 reordering the visibility of vector memory operations with respect to LDS
9630 operations of other wavefronts in the same work-group. A ``s_waitcnt
9631 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
9632 vector memory operations between wavefronts of a work-group, but not between
9633 operations performed by the same wavefront.
9634 * The vector memory operations are performed as wavefront wide operations and
9635 completion is reported to a wavefront in execution order. The exception is
9636 that ``flat_load/store/atomic`` instructions can report out of vector memory
9637 order if they access LDS memory, and out of LDS operation order if they access
9639 * The vector memory operations access a single vector L1 cache shared by all
9640 SIMDs a CU. Therefore:
9642 * No special action is required for coherence between the lanes of a single
9645 * No special action is required for coherence between wavefronts in the same
9646 work-group since they execute on the same CU. The exception is when in
9647 tgsplit execution mode as wavefronts of the same work-group can be in
9648 different CUs and so a ``buffer_inv sc0`` is required which will invalidate
9651 * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
9652 between wavefronts executing in different work-groups as they may be
9653 executing on different CUs.
9655 * Atomic read-modify-write instructions implicitly bypass the L1 cache.
9656 Therefore, they do not use the sc0 bit for coherence and instead use it to
9657 indicate if the instruction returns the original value being updated. They
9658 do use sc1 to indicate system or agent scope coherence.
9660 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
9661 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
9662 scalar operations are used in a restricted way so do not impact the memory
9663 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
9664 * The vector and scalar memory operations use an L2 cache.
9666 * The gfx942 can be configured as a number of smaller agents with each having
9667 a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
9668 larger agents with groups of CUs on each agent each sharing separate L2
9670 * The L2 cache has independent channels to service disjoint ranges of virtual
9672 * Each CU has a separate request queue per channel for its associated L2.
9673 Therefore, the vector and scalar memory operations performed by wavefronts
9674 executing with different L1 caches and the same L2 cache can be reordered
9675 relative to each other.
9676 * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
9677 vector memory operations of different CUs. It ensures a previous vector
9678 memory operation has completed before executing a subsequent vector memory
9679 or LDS operation and so can be used to meet the requirements of acquire and
9681 * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
9682 (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
9683 the PTE C-bit set for memory not local to the L2.
9685 * Any local memory cache lines will be automatically invalidated by writes
9686 from CUs associated with other L2 caches, or writes from the CPU, due to
9687 the cache probe caused by the PTE C-bit.
9688 * XGMI accesses from the CPU to local memory may be cached on the CPU.
9689 Subsequent access from the GPU will automatically invalidate or writeback
9690 the CPU cache due to the L2 probe filter.
9691 * To ensure coherence of local memory writes of CUs with different L1 caches
9692 in the same agent a ``buffer_wbl2`` is required. It does nothing if the
9693 agent is configured to have a single L2, or will writeback dirty L2 cache
9694 lines if configured to have multiple L2 caches.
9695 * To ensure coherence of local memory writes of CUs in different agents a
9696 ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
9697 * To ensure coherence of local memory reads of CUs with different L1 caches
9698 in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
9699 agent is configured to have a single L2, or will invalidate non-local L2
9700 cache lines if configured to have multiple L2 caches.
9701 * To ensure coherence of local memory reads of CUs in different agents a
9702 ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
9703 lines if configured to have multiple L2 caches.
9705 * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
9706 UC (uncached) which bypasses the L2.
9708 Scalar memory operations are only used to access memory that is proven to not
9709 change during the execution of the kernel dispatch. This includes constant
9710 address space and global address space for program scope ``const`` variables.
9711 Therefore, the kernel machine code does not have to maintain the scalar cache to
9712 ensure it is coherent with the vector caches. The scalar and vector caches are
9713 invalidated between kernel dispatches by CP since constant address space data
9714 may change between kernel dispatch executions. See
9715 :ref:`amdgpu-amdhsa-memory-spaces`.
9717 The one exception is if scalar writes are used to spill SGPR registers. In this
9718 case the AMDGPU backend ensures the memory location used to spill is never
9719 accessed by vector memory operations at the same time. If scalar writes are used
9720 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
9721 return since the locations may be used for vector memory instructions by a
9722 future wavefront that uses the same scratch area, or a function call that
9723 creates a frame at the same address, respectively. There is no need for a
9724 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
9726 For kernarg backing memory:
9728 * CP invalidates the L1 cache at the start of each kernel dispatch.
9729 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
9730 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
9731 cache. This also causes it to be treated as non-volatile and so is not
9732 invalidated by ``*_vol``.
9733 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
9734 so the L2 cache will be coherent with the CPU and other agents.
9736 Scratch backing memory (which is used for the private address space) is accessed
9737 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
9738 only accessed by a single thread, and is always write-before-read, there is
9739 never a need to invalidate these entries from the L1 cache. Hence all cache
9740 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
9742 The code sequences used to implement the memory model for GFX940, GFX941, GFX942
9743 are defined in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-gfx941-gfx942-table`.
9745 .. table:: AMDHSA Memory Model Code Sequences GFX940, GFX941, GFX942
9746 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-gfx941-gfx942-table
9748 ============ ============ ============== ========== ================================
9749 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
9750 Ordering Sync Scope Address GFX940, GFX941, GFX942
9752 ============ ============ ============== ========== ================================
9754 ------------------------------------------------------------------------------------
9755 load *none* *none* - global - !volatile & !nontemporal
9757 - private 1. buffer/global/flat_load
9759 - !volatile & nontemporal
9761 1. buffer/global/flat_load
9766 1. buffer/global/flat_load
9768 2. s_waitcnt vmcnt(0)
9770 - Must happen before
9771 any following volatile
9782 load *none* *none* - local 1. ds_load
9783 store *none* *none* - global - !volatile & !nontemporal
9785 - private 1. GFX940, GFX941
9786 - constant buffer/global/flat_store
9789 buffer/global/flat_store
9791 - !volatile & nontemporal
9794 buffer/global/flat_store
9797 buffer/global/flat_store
9802 1. buffer/global/flat_store
9804 2. s_waitcnt vmcnt(0)
9806 - Must happen before
9807 any following volatile
9818 store *none* *none* - local 1. ds_store
9819 **Unordered Atomic**
9820 ------------------------------------------------------------------------------------
9821 load atomic unordered *any* *any* *Same as non-atomic*.
9822 store atomic unordered *any* *any* *Same as non-atomic*.
9823 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
9824 **Monotonic Atomic**
9825 ------------------------------------------------------------------------------------
9826 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
9827 - wavefront - generic
9828 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
9830 load atomic monotonic - singlethread - local *If TgSplit execution mode,
9831 - wavefront local address space cannot
9832 - workgroup be used.*
9835 load atomic monotonic - agent - global 1. buffer/global/flat_load
9837 load atomic monotonic - system - global 1. buffer/global/flat_load
9838 - generic sc0=1 sc1=1
9839 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
9840 - wavefront - generic
9841 store atomic monotonic - workgroup - global 1. buffer/global/flat_store
9843 store atomic monotonic - agent - global 1. buffer/global/flat_store
9845 store atomic monotonic - system - global 1. buffer/global/flat_store
9846 - generic sc0=1 sc1=1
9847 store atomic monotonic - singlethread - local *If TgSplit execution mode,
9848 - wavefront local address space cannot
9849 - workgroup be used.*
9852 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
9853 - wavefront - generic
9856 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
9858 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
9859 - wavefront local address space cannot
9860 - workgroup be used.*
9864 ------------------------------------------------------------------------------------
9865 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
9868 load atomic acquire - workgroup - global 1. buffer/global_load sc0=1
9869 2. s_waitcnt vmcnt(0)
9871 - If not TgSplit execution
9873 - Must happen before the
9874 following buffer_inv.
9878 - If not TgSplit execution
9880 - Must happen before
9891 load atomic acquire - workgroup - local *If TgSplit execution mode,
9892 local address space cannot
9896 2. s_waitcnt lgkmcnt(0)
9899 - Must happen before
9908 older than the local load
9912 load atomic acquire - workgroup - generic 1. flat_load sc0=1
9913 2. s_waitcnt lgkm/vmcnt(0)
9915 - Use lgkmcnt(0) if not
9916 TgSplit execution mode
9917 and vmcnt(0) if TgSplit
9919 - If OpenCL, omit lgkmcnt(0).
9920 - Must happen before
9923 following global/generic
9930 older than a local load
9936 - If not TgSplit execution
9943 load atomic acquire - agent - global 1. buffer/global_load
9945 2. s_waitcnt vmcnt(0)
9947 - Must happen before
9957 - Must happen before
9967 load atomic acquire - system - global 1. buffer/global/flat_load
9969 2. s_waitcnt vmcnt(0)
9971 - Must happen before
9979 3. buffer_inv sc0=1 sc1=1
9981 - Must happen before
9989 stale MTYPE NC global data.
9990 MTYPE RW and CC memory will
9991 never be stale due to the
9994 load atomic acquire - agent - generic 1. flat_load sc1=1
9995 2. s_waitcnt vmcnt(0) &
9998 - If TgSplit execution mode,
10002 - Must happen before
10005 - Ensures the flat_load
10007 before invalidating
10010 3. buffer_inv sc1=1
10012 - Must happen before
10022 load atomic acquire - system - generic 1. flat_load sc0=1 sc1=1
10023 2. s_waitcnt vmcnt(0) &
10026 - If TgSplit execution mode,
10030 - Must happen before
10033 - Ensures the flat_load
10035 before invalidating
10038 3. buffer_inv sc0=1 sc1=1
10040 - Must happen before
10048 stale MTYPE NC global data.
10049 MTYPE RW and CC memory will
10050 never be stale due to the
10053 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
10054 - wavefront - generic
10055 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
10056 - wavefront local address space cannot
10060 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
10061 2. s_waitcnt vmcnt(0)
10063 - If not TgSplit execution
10065 - Must happen before the
10066 following buffer_inv.
10067 - Ensures the atomicrmw
10069 before invalidating
10072 3. buffer_inv sc0=1
10074 - If not TgSplit execution
10076 - Must happen before
10086 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
10087 local address space cannot
10091 2. s_waitcnt lgkmcnt(0)
10094 - Must happen before
10103 older than the local
10107 atomicrmw acquire - workgroup - generic 1. flat_atomic
10108 2. s_waitcnt lgkm/vmcnt(0)
10110 - Use lgkmcnt(0) if not
10111 TgSplit execution mode
10112 and vmcnt(0) if TgSplit
10114 - If OpenCL, omit lgkmcnt(0).
10115 - Must happen before
10130 3. buffer_inv sc0=1
10132 - If not TgSplit execution
10139 atomicrmw acquire - agent - global 1. buffer/global_atomic
10140 2. s_waitcnt vmcnt(0)
10142 - Must happen before
10151 3. buffer_inv sc1=1
10153 - Must happen before
10163 atomicrmw acquire - system - global 1. buffer/global_atomic
10165 2. s_waitcnt vmcnt(0)
10167 - Must happen before
10176 3. buffer_inv sc0=1 sc1=1
10178 - Must happen before
10186 stale MTYPE NC global data.
10187 MTYPE RW and CC memory will
10188 never be stale due to the
10191 atomicrmw acquire - agent - generic 1. flat_atomic
10192 2. s_waitcnt vmcnt(0) &
10195 - If TgSplit execution mode,
10199 - Must happen before
10208 3. buffer_inv sc1=1
10210 - Must happen before
10220 atomicrmw acquire - system - generic 1. flat_atomic sc1=1
10221 2. s_waitcnt vmcnt(0) &
10224 - If TgSplit execution mode,
10228 - Must happen before
10237 3. buffer_inv sc0=1 sc1=1
10239 - Must happen before
10247 stale MTYPE NC global data.
10248 MTYPE RW and CC memory will
10249 never be stale due to the
10252 fence acquire - singlethread *none* *none*
10254 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
10256 - Use lgkmcnt(0) if not
10257 TgSplit execution mode
10258 and vmcnt(0) if TgSplit
10268 - However, since LLVM
10273 always generate. If
10283 - s_waitcnt vmcnt(0)
10286 global/generic load
10291 and memory ordering
10295 fence-paired-atomic).
10296 - s_waitcnt lgkmcnt(0)
10303 and memory ordering
10307 fence-paired-atomic).
10308 - Must happen before
10321 fence-paired-atomic.
10323 3. buffer_inv sc0=1
10325 - If not TgSplit execution
10332 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
10335 - If TgSplit execution mode,
10341 - However, since LLVM
10349 - Could be split into
10353 lgkmcnt(0) to allow
10355 independently moved
10358 - s_waitcnt vmcnt(0)
10361 global/generic load
10365 and memory ordering
10369 fence-paired-atomic).
10370 - s_waitcnt lgkmcnt(0)
10377 and memory ordering
10381 fence-paired-atomic).
10382 - Must happen before
10386 fence-paired atomic
10388 before invalidating
10392 locations read must
10396 fence-paired-atomic.
10398 2. buffer_inv sc1=1
10400 - Must happen before any
10401 following global/generic
10410 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
10413 - If TgSplit execution mode,
10419 - However, since LLVM
10427 - Could be split into
10431 lgkmcnt(0) to allow
10433 independently moved
10436 - s_waitcnt vmcnt(0)
10439 global/generic load
10443 and memory ordering
10447 fence-paired-atomic).
10448 - s_waitcnt lgkmcnt(0)
10455 and memory ordering
10459 fence-paired-atomic).
10460 - Must happen before
10464 fence-paired atomic
10466 before invalidating
10470 locations read must
10474 fence-paired-atomic.
10476 2. buffer_inv sc0=1 sc1=1
10478 - Must happen before any
10479 following global/generic
10489 ------------------------------------------------------------------------------------
10490 store atomic release - singlethread - global 1. GFX940, GFX941
10491 - wavefront - generic buffer/global/flat_store
10494 buffer/global/flat_store
10496 store atomic release - singlethread - local *If TgSplit execution mode,
10497 - wavefront local address space cannot
10501 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10503 - Use lgkmcnt(0) if not
10504 TgSplit execution mode
10505 and vmcnt(0) if TgSplit
10507 - If OpenCL, omit lgkmcnt(0).
10508 - s_waitcnt vmcnt(0)
10511 global/generic load/store/
10512 load atomic/store atomic/
10514 - s_waitcnt lgkmcnt(0)
10521 - Must happen before
10529 store that is being
10533 buffer/global/flat_store
10536 buffer/global/flat_store
10538 store atomic release - workgroup - local *If TgSplit execution mode,
10539 local address space cannot
10543 store atomic release - agent - global 1. buffer_wbl2 sc1=1
10545 - Must happen before
10546 following s_waitcnt.
10547 - Performs L2 writeback to
10550 store/atomicrmw are
10551 visible at agent scope.
10553 2. s_waitcnt lgkmcnt(0) &
10556 - If TgSplit execution mode,
10562 - Could be split into
10566 lgkmcnt(0) to allow
10568 independently moved
10571 - s_waitcnt vmcnt(0)
10578 - s_waitcnt lgkmcnt(0)
10585 - Must happen before
10593 store that is being
10597 buffer/global/flat_store
10600 buffer/global/flat_store
10602 store atomic release - system - global 1. buffer_wbl2 sc0=1 sc1=1
10604 - Must happen before
10605 following s_waitcnt.
10606 - Performs L2 writeback to
10609 store/atomicrmw are
10610 visible at system scope.
10612 2. s_waitcnt lgkmcnt(0) &
10615 - If TgSplit execution mode,
10621 - Could be split into
10625 lgkmcnt(0) to allow
10627 independently moved
10630 - s_waitcnt vmcnt(0)
10631 must happen after any
10637 - s_waitcnt lgkmcnt(0)
10638 must happen after any
10644 - Must happen before
10649 to memory and the L2
10653 store that is being
10656 3. buffer/global/flat_store
10658 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
10659 - wavefront - generic
10660 atomicrmw release - singlethread - local *If TgSplit execution mode,
10661 - wavefront local address space cannot
10665 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10667 - Use lgkmcnt(0) if not
10668 TgSplit execution mode
10669 and vmcnt(0) if TgSplit
10673 - s_waitcnt vmcnt(0)
10676 global/generic load/store/
10677 load atomic/store atomic/
10679 - s_waitcnt lgkmcnt(0)
10686 - Must happen before
10697 2. buffer/global/flat_atomic sc0=1
10698 atomicrmw release - workgroup - local *If TgSplit execution mode,
10699 local address space cannot
10703 atomicrmw release - agent - global 1. buffer_wbl2 sc1=1
10705 - Must happen before
10706 following s_waitcnt.
10707 - Performs L2 writeback to
10710 store/atomicrmw are
10711 visible at agent scope.
10713 2. s_waitcnt lgkmcnt(0) &
10716 - If TgSplit execution mode,
10720 - Could be split into
10724 lgkmcnt(0) to allow
10726 independently moved
10729 - s_waitcnt vmcnt(0)
10736 - s_waitcnt lgkmcnt(0)
10743 - Must happen before
10748 to global and local
10754 3. buffer/global/flat_atomic sc1=1
10755 atomicrmw release - system - global 1. buffer_wbl2 sc0=1 sc1=1
10757 - Must happen before
10758 following s_waitcnt.
10759 - Performs L2 writeback to
10762 store/atomicrmw are
10763 visible at system scope.
10765 2. s_waitcnt lgkmcnt(0) &
10768 - If TgSplit execution mode,
10772 - Could be split into
10776 lgkmcnt(0) to allow
10778 independently moved
10781 - s_waitcnt vmcnt(0)
10788 - s_waitcnt lgkmcnt(0)
10795 - Must happen before
10800 to memory and the L2
10804 store that is being
10807 3. buffer/global/flat_atomic
10809 fence release - singlethread *none* *none*
10811 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
10813 - Use lgkmcnt(0) if not
10814 TgSplit execution mode
10815 and vmcnt(0) if TgSplit
10825 - However, since LLVM
10830 always generate. If
10840 - s_waitcnt vmcnt(0)
10845 load atomic/store atomic/
10847 - s_waitcnt lgkmcnt(0)
10854 - Must happen before
10855 any following store
10859 and memory ordering
10863 fence-paired-atomic).
10870 fence-paired-atomic.
10872 fence release - agent *none* 1. buffer_wbl2 sc1=1
10877 - Must happen before
10878 following s_waitcnt.
10879 - Performs L2 writeback to
10882 store/atomicrmw are
10883 visible at agent scope.
10885 2. s_waitcnt lgkmcnt(0) &
10888 - If TgSplit execution mode,
10898 - However, since LLVM
10903 always generate. If
10913 - Could be split into
10917 lgkmcnt(0) to allow
10919 independently moved
10922 - s_waitcnt vmcnt(0)
10929 - s_waitcnt lgkmcnt(0)
10936 - Must happen before
10937 any following store
10941 and memory ordering
10945 fence-paired-atomic).
10952 fence-paired-atomic.
10954 fence release - system *none* 1. buffer_wbl2 sc0=1 sc1=1
10956 - Must happen before
10957 following s_waitcnt.
10958 - Performs L2 writeback to
10961 store/atomicrmw are
10962 visible at system scope.
10964 2. s_waitcnt lgkmcnt(0) &
10967 - If TgSplit execution mode,
10977 - However, since LLVM
10982 always generate. If
10992 - Could be split into
10996 lgkmcnt(0) to allow
10998 independently moved
11001 - s_waitcnt vmcnt(0)
11008 - s_waitcnt lgkmcnt(0)
11015 - Must happen before
11016 any following store
11020 and memory ordering
11024 fence-paired-atomic).
11031 fence-paired-atomic.
11033 **Acquire-Release Atomic**
11034 ------------------------------------------------------------------------------------
11035 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
11036 - wavefront - generic
11037 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
11038 - wavefront local address space cannot
11042 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
11044 - Use lgkmcnt(0) if not
11045 TgSplit execution mode
11046 and vmcnt(0) if TgSplit
11050 - Must happen after
11056 - s_waitcnt vmcnt(0)
11059 global/generic load/store/
11060 load atomic/store atomic/
11062 - s_waitcnt lgkmcnt(0)
11069 - Must happen before
11080 2. buffer/global_atomic
11081 3. s_waitcnt vmcnt(0)
11083 - If not TgSplit execution
11085 - Must happen before
11095 4. buffer_inv sc0=1
11097 - If not TgSplit execution
11104 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
11105 local address space cannot
11109 2. s_waitcnt lgkmcnt(0)
11112 - Must happen before
11121 older than the local load
11125 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
11127 - Use lgkmcnt(0) if not
11128 TgSplit execution mode
11129 and vmcnt(0) if TgSplit
11133 - s_waitcnt vmcnt(0)
11136 global/generic load/store/
11137 load atomic/store atomic/
11139 - s_waitcnt lgkmcnt(0)
11146 - Must happen before
11158 3. s_waitcnt lgkmcnt(0) &
11161 - If not TgSplit execution
11162 mode, omit vmcnt(0).
11165 - Must happen before
11176 older than a local load
11180 3. buffer_inv sc0=1
11182 - If not TgSplit execution
11189 atomicrmw acq_rel - agent - global 1. buffer_wbl2 sc1=1
11191 - Must happen before
11192 following s_waitcnt.
11193 - Performs L2 writeback to
11196 store/atomicrmw are
11197 visible at agent scope.
11199 2. s_waitcnt lgkmcnt(0) &
11202 - If TgSplit execution mode,
11206 - Could be split into
11210 lgkmcnt(0) to allow
11212 independently moved
11215 - s_waitcnt vmcnt(0)
11222 - s_waitcnt lgkmcnt(0)
11229 - Must happen before
11240 3. buffer/global_atomic
11241 4. s_waitcnt vmcnt(0)
11243 - Must happen before
11252 5. buffer_inv sc1=1
11254 - Must happen before
11264 atomicrmw acq_rel - system - global 1. buffer_wbl2 sc0=1 sc1=1
11266 - Must happen before
11267 following s_waitcnt.
11268 - Performs L2 writeback to
11271 store/atomicrmw are
11272 visible at system scope.
11274 2. s_waitcnt lgkmcnt(0) &
11277 - If TgSplit execution mode,
11281 - Could be split into
11285 lgkmcnt(0) to allow
11287 independently moved
11290 - s_waitcnt vmcnt(0)
11297 - s_waitcnt lgkmcnt(0)
11304 - Must happen before
11309 to global and L2 writeback
11310 have completed before
11315 3. buffer/global_atomic
11317 4. s_waitcnt vmcnt(0)
11319 - Must happen before
11328 5. buffer_inv sc0=1 sc1=1
11330 - Must happen before
11338 MTYPE NC global data.
11339 MTYPE RW and CC memory will
11340 never be stale due to the
11343 atomicrmw acq_rel - agent - generic 1. buffer_wbl2 sc1=1
11345 - Must happen before
11346 following s_waitcnt.
11347 - Performs L2 writeback to
11350 store/atomicrmw are
11351 visible at agent scope.
11353 2. s_waitcnt lgkmcnt(0) &
11356 - If TgSplit execution mode,
11360 - Could be split into
11364 lgkmcnt(0) to allow
11366 independently moved
11369 - s_waitcnt vmcnt(0)
11376 - s_waitcnt lgkmcnt(0)
11383 - Must happen before
11395 4. s_waitcnt vmcnt(0) &
11398 - If TgSplit execution mode,
11402 - Must happen before
11411 5. buffer_inv sc1=1
11413 - Must happen before
11423 atomicrmw acq_rel - system - generic 1. buffer_wbl2 sc0=1 sc1=1
11425 - Must happen before
11426 following s_waitcnt.
11427 - Performs L2 writeback to
11430 store/atomicrmw are
11431 visible at system scope.
11433 2. s_waitcnt lgkmcnt(0) &
11436 - If TgSplit execution mode,
11440 - Could be split into
11444 lgkmcnt(0) to allow
11446 independently moved
11449 - s_waitcnt vmcnt(0)
11456 - s_waitcnt lgkmcnt(0)
11463 - Must happen before
11468 to global and L2 writeback
11469 have completed before
11474 3. flat_atomic sc1=1
11475 4. s_waitcnt vmcnt(0) &
11478 - If TgSplit execution mode,
11482 - Must happen before
11491 5. buffer_inv sc0=1 sc1=1
11493 - Must happen before
11501 MTYPE NC global data.
11502 MTYPE RW and CC memory will
11503 never be stale due to the
11506 fence acq_rel - singlethread *none* *none*
11508 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
11510 - Use lgkmcnt(0) if not
11511 TgSplit execution mode
11512 and vmcnt(0) if TgSplit
11531 - s_waitcnt vmcnt(0)
11536 load atomic/store atomic/
11538 - s_waitcnt lgkmcnt(0)
11545 - Must happen before
11564 and memory ordering
11568 acquire-fence-paired-atomic)
11581 local/generic store
11585 and memory ordering
11589 release-fence-paired-atomic).
11593 - Must happen before
11597 acquire-fence-paired
11598 atomic has completed
11599 before invalidating
11603 locations read must
11607 acquire-fence-paired-atomic.
11609 3. buffer_inv sc0=1
11611 - If not TgSplit execution
11618 fence acq_rel - agent *none* 1. buffer_wbl2 sc1=1
11623 - Must happen before
11624 following s_waitcnt.
11625 - Performs L2 writeback to
11628 store/atomicrmw are
11629 visible at agent scope.
11631 2. s_waitcnt lgkmcnt(0) &
11634 - If TgSplit execution mode,
11640 - However, since LLVM
11648 - Could be split into
11652 lgkmcnt(0) to allow
11654 independently moved
11657 - s_waitcnt vmcnt(0)
11664 - s_waitcnt lgkmcnt(0)
11671 - Must happen before
11676 global/local/generic
11681 and memory ordering
11685 acquire-fence-paired-atomic)
11687 before invalidating
11697 global/local/generic
11702 and memory ordering
11706 release-fence-paired-atomic).
11711 3. buffer_inv sc1=1
11713 - Must happen before
11727 fence acq_rel - system *none* 1. buffer_wbl2 sc0=1 sc1=1
11732 - Must happen before
11733 following s_waitcnt.
11734 - Performs L2 writeback to
11737 store/atomicrmw are
11738 visible at system scope.
11740 1. s_waitcnt lgkmcnt(0) &
11743 - If TgSplit execution mode,
11749 - However, since LLVM
11757 - Could be split into
11761 lgkmcnt(0) to allow
11763 independently moved
11766 - s_waitcnt vmcnt(0)
11773 - s_waitcnt lgkmcnt(0)
11780 - Must happen before
11785 global/local/generic
11790 and memory ordering
11794 acquire-fence-paired-atomic)
11796 before invalidating
11806 global/local/generic
11811 and memory ordering
11815 release-fence-paired-atomic).
11820 2. buffer_inv sc0=1 sc1=1
11822 - Must happen before
11831 MTYPE NC global data.
11832 MTYPE RW and CC memory will
11833 never be stale due to the
11836 **Sequential Consistent Atomic**
11837 ------------------------------------------------------------------------------------
11838 load atomic seq_cst - singlethread - global *Same as corresponding
11839 - wavefront - local load atomic acquire,
11840 - generic except must generate
11841 all instructions even
11843 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
11845 - Use lgkmcnt(0) if not
11846 TgSplit execution mode
11847 and vmcnt(0) if TgSplit
11849 - s_waitcnt lgkmcnt(0) must
11856 ordering of seq_cst
11862 lgkmcnt(0) and so do
11865 - s_waitcnt vmcnt(0)
11868 global/generic load
11872 ordering of seq_cst
11884 consistent global/local
11885 memory instructions
11891 prevents reordering
11894 seq_cst load. (Note
11900 followed by a store
11907 release followed by
11910 order. The s_waitcnt
11911 could be placed after
11912 seq_store or before
11915 make the s_waitcnt be
11916 as late as possible
11922 instructions same as
11925 except must generate
11926 all instructions even
11928 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
11929 local address space cannot
11932 *Same as corresponding
11933 load atomic acquire,
11934 except must generate
11935 all instructions even
11938 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
11939 - system - generic vmcnt(0)
11941 - If TgSplit execution mode,
11943 - Could be split into
11947 lgkmcnt(0) to allow
11949 independently moved
11952 - s_waitcnt lgkmcnt(0)
11955 global/generic load
11959 ordering of seq_cst
11965 lgkmcnt(0) and so do
11968 - s_waitcnt vmcnt(0)
11971 global/generic load
11975 ordering of seq_cst
11988 memory instructions
11994 prevents reordering
11997 seq_cst load. (Note
12003 followed by a store
12010 release followed by
12013 order. The s_waitcnt
12014 could be placed after
12015 seq_store or before
12018 make the s_waitcnt be
12019 as late as possible
12025 instructions same as
12028 except must generate
12029 all instructions even
12031 store atomic seq_cst - singlethread - global *Same as corresponding
12032 - wavefront - local store atomic release,
12033 - workgroup - generic except must generate
12034 - agent all instructions even
12035 - system for OpenCL.*
12036 atomicrmw seq_cst - singlethread - global *Same as corresponding
12037 - wavefront - local atomicrmw acq_rel,
12038 - workgroup - generic except must generate
12039 - agent all instructions even
12040 - system for OpenCL.*
12041 fence seq_cst - singlethread *none* *Same as corresponding
12042 - wavefront fence acq_rel,
12043 - workgroup except must generate
12044 - agent all instructions even
12045 - system for OpenCL.*
12046 ============ ============ ============== ========== ================================
12048 .. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
12050 Memory Model GFX10-GFX11
12051 ++++++++++++++++++++++++
12055 * Each agent has multiple shader arrays (SA).
12056 * Each SA has multiple work-group processors (WGP).
12057 * Each WGP has multiple compute units (CU).
12058 * Each CU has multiple SIMDs that execute wavefronts.
12059 * The wavefronts for a single work-group are executed in the same
12060 WGP. In CU wavefront execution mode the wavefronts may be executed by
12061 different SIMDs in the same CU. In WGP wavefront execution mode the
12062 wavefronts may be executed by different SIMDs in different CUs in the same
12064 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
12066 * All LDS operations of a WGP are performed as wavefront wide operations in a
12067 global order and involve no caching. Completion is reported to a wavefront in
12069 * The LDS memory has multiple request queues shared by the SIMDs of a
12070 WGP. Therefore, the LDS operations performed by different wavefronts of a
12071 work-group can be reordered relative to each other, which can result in
12072 reordering the visibility of vector memory operations with respect to LDS
12073 operations of other wavefronts in the same work-group. A ``s_waitcnt
12074 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
12075 vector memory operations between wavefronts of a work-group, but not between
12076 operations performed by the same wavefront.
12077 * The vector memory operations are performed as wavefront wide operations.
12078 Completion of load/store/sample operations are reported to a wavefront in
12079 execution order of other load/store/sample operations performed by that
12081 * The vector memory operations access a vector L0 cache. There is a single L0
12082 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
12083 special action is required for coherence between the lanes of a single
12084 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
12085 wavefronts executing in the same work-group as they may be executing on SIMDs
12086 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
12087 required for coherence between wavefronts executing in different work-groups
12088 as they may be executing on different WGPs.
12089 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
12090 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
12091 operations are used in a restricted way so do not impact the memory model. See
12092 :ref:`amdgpu-amdhsa-memory-spaces`.
12093 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
12094 the same SA. Therefore, no special action is required for coherence between
12095 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
12096 required for coherence between wavefronts executing in different work-groups
12097 as they may be executing on different SAs that access different L1s.
12098 * The L1 caches have independent quadrants to service disjoint ranges of virtual
12100 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
12101 vector and scalar memory operations performed by different wavefronts, whether
12102 executing in the same or different work-groups (which may be executing on
12103 different CUs accessing different L0s), can be reordered relative to each
12104 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
12105 synchronization between vector memory operations of different wavefronts. It
12106 ensures a previous vector memory operation has completed before executing a
12107 subsequent vector memory or LDS operation and so can be used to meet the
12108 requirements of acquire, release and sequential consistency.
12109 * The L1 caches use an L2 cache shared by all SAs on the same agent.
12110 * The L2 cache has independent channels to service disjoint ranges of virtual
12112 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
12113 quadrant has a separate request queue per L2 channel. Therefore, the vector
12114 and scalar memory operations performed by wavefronts executing in different
12115 work-groups (which may be executing on different SAs) of an agent can be
12116 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
12117 required to ensure synchronization between vector memory operations of
12118 different SAs. It ensures a previous vector memory operation has completed
12119 before executing a subsequent vector memory and so can be used to meet the
12120 requirements of acquire, release and sequential consistency.
12121 * The L2 cache can be kept coherent with other agents on some targets, or ranges
12122 of virtual addresses can be set up to bypass it to ensure system coherence.
12123 * On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
12124 The MALL cache is fully coherent with GPU memory and has no impact on system
12125 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
12127 Scalar memory operations are only used to access memory that is proven to not
12128 change during the execution of the kernel dispatch. This includes constant
12129 address space and global address space for program scope ``const`` variables.
12130 Therefore, the kernel machine code does not have to maintain the scalar cache to
12131 ensure it is coherent with the vector caches. The scalar and vector caches are
12132 invalidated between kernel dispatches by CP since constant address space data
12133 may change between kernel dispatch executions. See
12134 :ref:`amdgpu-amdhsa-memory-spaces`.
12136 The one exception is if scalar writes are used to spill SGPR registers. In this
12137 case the AMDGPU backend ensures the memory location used to spill is never
12138 accessed by vector memory operations at the same time. If scalar writes are used
12139 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
12140 return since the locations may be used for vector memory instructions by a
12141 future wavefront that uses the same scratch area, or a function call that
12142 creates a frame at the same address, respectively. There is no need for a
12143 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
12145 For kernarg backing memory:
12147 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
12148 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
12149 needing to invalidate the L2 cache.
12150 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
12151 so the L2 cache will be coherent with the CPU and other agents.
12153 Scratch backing memory (which is used for the private address space) is accessed
12154 with MTYPE NC (non-coherent). Since the private address space is only accessed
12155 by a single thread, and is always write-before-read, there is never a need to
12156 invalidate these entries from the L0 or L1 caches.
12158 Wavefronts are executed in native mode with in-order reporting of loads and
12159 sample instructions. In this mode vmcnt reports completion of load, atomic with
12160 return and sample instructions in order, and the vscnt reports the completion of
12161 store and atomic without return in order. See ``MEM_ORDERED`` field in
12162 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
12164 Wavefronts can be executed in WGP or CU wavefront execution mode:
12166 * In WGP wavefront execution mode the wavefronts of a work-group are executed
12167 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
12168 CU L0 caches is required for work-group synchronization. Also accesses to L1
12169 at work-group scope need to be explicitly ordered as the accesses from
12170 different CUs are not ordered.
12171 * In CU wavefront execution mode the wavefronts of a work-group are executed on
12172 the SIMDs of a single CU of the WGP. Therefore, all global memory access by
12173 the work-group access the same L0 which in turn ensures L1 accesses are
12174 ordered and so do not require explicit management of the caches for
12175 work-group synchronization.
12177 See ``WGP_MODE`` field in
12178 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table` and
12179 :ref:`amdgpu-target-features`.
12181 The code sequences used to implement the memory model for GFX10-GFX11 are defined in
12182 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
12184 .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
12185 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
12187 ============ ============ ============== ========== ================================
12188 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
12189 Ordering Sync Scope Address GFX10-GFX11
12191 ============ ============ ============== ========== ================================
12193 ------------------------------------------------------------------------------------
12194 load *none* *none* - global - !volatile & !nontemporal
12196 - private 1. buffer/global/flat_load
12198 - !volatile & nontemporal
12200 1. buffer/global/flat_load
12203 - If GFX10, omit dlc=1.
12207 1. buffer/global/flat_load
12210 2. s_waitcnt vmcnt(0)
12212 - Must happen before
12213 any following volatile
12224 load *none* *none* - local 1. ds_load
12225 store *none* *none* - global - !volatile & !nontemporal
12227 - private 1. buffer/global/flat_store
12229 - !volatile & nontemporal
12231 1. buffer/global/flat_store
12234 - If GFX10, omit dlc=1.
12238 1. buffer/global/flat_store
12241 - If GFX10, omit dlc=1.
12243 2. s_waitcnt vscnt(0)
12245 - Must happen before
12246 any following volatile
12257 store *none* *none* - local 1. ds_store
12258 **Unordered Atomic**
12259 ------------------------------------------------------------------------------------
12260 load atomic unordered *any* *any* *Same as non-atomic*.
12261 store atomic unordered *any* *any* *Same as non-atomic*.
12262 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
12263 **Monotonic Atomic**
12264 ------------------------------------------------------------------------------------
12265 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
12266 - wavefront - generic
12267 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
12270 - If CU wavefront execution
12273 load atomic monotonic - singlethread - local 1. ds_load
12276 load atomic monotonic - agent - global 1. buffer/global/flat_load
12277 - system - generic glc=1 dlc=1
12279 - If GFX11, omit dlc=1.
12281 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
12282 - wavefront - generic
12286 store atomic monotonic - singlethread - local 1. ds_store
12289 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
12290 - wavefront - generic
12294 atomicrmw monotonic - singlethread - local 1. ds_atomic
12298 ------------------------------------------------------------------------------------
12299 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
12300 - wavefront - local
12302 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
12304 - If CU wavefront execution
12307 2. s_waitcnt vmcnt(0)
12309 - If CU wavefront execution
12311 - Must happen before
12312 the following buffer_gl0_inv
12313 and before any following
12321 - If CU wavefront execution
12328 load atomic acquire - workgroup - local 1. ds_load
12329 2. s_waitcnt lgkmcnt(0)
12332 - Must happen before
12333 the following buffer_gl0_inv
12334 and before any following
12335 global/generic load/load
12341 older than the local load
12347 - If CU wavefront execution
12355 load atomic acquire - workgroup - generic 1. flat_load glc=1
12357 - If CU wavefront execution
12360 2. s_waitcnt lgkmcnt(0) &
12363 - If CU wavefront execution
12364 mode, omit vmcnt(0).
12367 - Must happen before
12369 buffer_gl0_inv and any
12370 following global/generic
12377 older than a local load
12383 - If CU wavefront execution
12390 load atomic acquire - agent - global 1. buffer/global_load
12391 - system glc=1 dlc=1
12393 - If GFX11, omit dlc=1.
12395 2. s_waitcnt vmcnt(0)
12397 - Must happen before
12402 before invalidating
12408 - Must happen before
12418 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1
12420 - If GFX11, omit dlc=1.
12422 2. s_waitcnt vmcnt(0) &
12427 - Must happen before
12430 - Ensures the flat_load
12432 before invalidating
12438 - Must happen before
12448 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
12449 - wavefront - local
12451 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
12452 2. s_waitcnt vm/vscnt(0)
12454 - If CU wavefront execution
12456 - Use vmcnt(0) if atomic with
12457 return and vscnt(0) if
12458 atomic with no-return.
12459 - Must happen before
12460 the following buffer_gl0_inv
12461 and before any following
12469 - If CU wavefront execution
12476 atomicrmw acquire - workgroup - local 1. ds_atomic
12477 2. s_waitcnt lgkmcnt(0)
12480 - Must happen before
12486 older than the local
12498 atomicrmw acquire - workgroup - generic 1. flat_atomic
12499 2. s_waitcnt lgkmcnt(0) &
12502 - If CU wavefront execution
12503 mode, omit vm/vscnt(0).
12504 - If OpenCL, omit lgkmcnt(0).
12505 - Use vmcnt(0) if atomic with
12506 return and vscnt(0) if
12507 atomic with no-return.
12508 - Must happen before
12520 - If CU wavefront execution
12527 atomicrmw acquire - agent - global 1. buffer/global_atomic
12528 - system 2. s_waitcnt vm/vscnt(0)
12530 - Use vmcnt(0) if atomic with
12531 return and vscnt(0) if
12532 atomic with no-return.
12533 - Must happen before
12545 - Must happen before
12555 atomicrmw acquire - agent - generic 1. flat_atomic
12556 - system 2. s_waitcnt vm/vscnt(0) &
12561 - Use vmcnt(0) if atomic with
12562 return and vscnt(0) if
12563 atomic with no-return.
12564 - Must happen before
12576 - Must happen before
12586 fence acquire - singlethread *none* *none*
12588 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12589 vmcnt(0) & vscnt(0)
12591 - If CU wavefront execution
12592 mode, omit vmcnt(0) and
12601 vmcnt(0) and vscnt(0).
12602 - However, since LLVM
12607 always generate. If
12617 - Could be split into
12619 vmcnt(0), s_waitcnt
12620 vscnt(0) and s_waitcnt
12621 lgkmcnt(0) to allow
12623 independently moved
12626 - s_waitcnt vmcnt(0)
12629 global/generic load
12631 atomicrmw-with-return-value
12634 and memory ordering
12638 fence-paired-atomic).
12639 - s_waitcnt vscnt(0)
12643 atomicrmw-no-return-value
12646 and memory ordering
12650 fence-paired-atomic).
12651 - s_waitcnt lgkmcnt(0)
12658 and memory ordering
12662 fence-paired-atomic).
12663 - Must happen before
12667 fence-paired atomic
12669 before invalidating
12673 locations read must
12677 fence-paired-atomic.
12681 - If CU wavefront execution
12688 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
12689 - system vmcnt(0) & vscnt(0)
12698 vmcnt(0) and vscnt(0).
12699 - However, since LLVM
12707 - Could be split into
12709 vmcnt(0), s_waitcnt
12710 vscnt(0) and s_waitcnt
12711 lgkmcnt(0) to allow
12713 independently moved
12716 - s_waitcnt vmcnt(0)
12719 global/generic load
12721 atomicrmw-with-return-value
12724 and memory ordering
12728 fence-paired-atomic).
12729 - s_waitcnt vscnt(0)
12733 atomicrmw-no-return-value
12736 and memory ordering
12740 fence-paired-atomic).
12741 - s_waitcnt lgkmcnt(0)
12748 and memory ordering
12752 fence-paired-atomic).
12753 - Must happen before
12757 fence-paired atomic
12759 before invalidating
12763 locations read must
12767 fence-paired-atomic.
12772 - Must happen before any
12773 following global/generic
12783 ------------------------------------------------------------------------------------
12784 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
12785 - wavefront - local
12787 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12788 - generic vmcnt(0) & vscnt(0)
12790 - If CU wavefront execution
12791 mode, omit vmcnt(0) and
12795 - Could be split into
12797 vmcnt(0), s_waitcnt
12798 vscnt(0) and s_waitcnt
12799 lgkmcnt(0) to allow
12801 independently moved
12804 - s_waitcnt vmcnt(0)
12807 global/generic load/load
12809 atomicrmw-with-return-value.
12810 - s_waitcnt vscnt(0)
12816 atomicrmw-no-return-value.
12817 - s_waitcnt lgkmcnt(0)
12824 - Must happen before
12832 store that is being
12835 2. buffer/global/flat_store
12836 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12838 - If CU wavefront execution
12841 - Could be split into
12843 vmcnt(0) and s_waitcnt
12846 independently moved
12849 - s_waitcnt vmcnt(0)
12852 global/generic load/load
12854 atomicrmw-with-return-value.
12855 - s_waitcnt vscnt(0)
12859 store/store atomic/
12860 atomicrmw-no-return-value.
12861 - Must happen before
12869 store that is being
12873 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
12874 - system - generic vmcnt(0) & vscnt(0)
12880 - Could be split into
12882 vmcnt(0), s_waitcnt vscnt(0)
12884 lgkmcnt(0) to allow
12886 independently moved
12889 - s_waitcnt vmcnt(0)
12895 atomicrmw-with-return-value.
12896 - s_waitcnt vscnt(0)
12900 store/store atomic/
12901 atomicrmw-no-return-value.
12902 - s_waitcnt lgkmcnt(0)
12909 - Must happen before
12917 store that is being
12920 2. buffer/global/flat_store
12921 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
12922 - wavefront - local
12924 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12925 - generic vmcnt(0) & vscnt(0)
12927 - If CU wavefront execution
12928 mode, omit vmcnt(0) and
12930 - If OpenCL, omit lgkmcnt(0).
12931 - Could be split into
12933 vmcnt(0), s_waitcnt
12934 vscnt(0) and s_waitcnt
12935 lgkmcnt(0) to allow
12937 independently moved
12940 - s_waitcnt vmcnt(0)
12943 global/generic load/load
12945 atomicrmw-with-return-value.
12946 - s_waitcnt vscnt(0)
12952 atomicrmw-no-return-value.
12953 - s_waitcnt lgkmcnt(0)
12960 - Must happen before
12971 2. buffer/global/flat_atomic
12972 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12974 - If CU wavefront execution
12977 - Could be split into
12979 vmcnt(0) and s_waitcnt
12982 independently moved
12985 - s_waitcnt vmcnt(0)
12988 global/generic load/load
12990 atomicrmw-with-return-value.
12991 - s_waitcnt vscnt(0)
12995 store/store atomic/
12996 atomicrmw-no-return-value.
12997 - Must happen before
13005 store that is being
13009 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
13010 - system - generic vmcnt(0) & vscnt(0)
13014 - Could be split into
13016 vmcnt(0), s_waitcnt
13017 vscnt(0) and s_waitcnt
13018 lgkmcnt(0) to allow
13020 independently moved
13023 - s_waitcnt vmcnt(0)
13028 atomicrmw-with-return-value.
13029 - s_waitcnt vscnt(0)
13033 store/store atomic/
13034 atomicrmw-no-return-value.
13035 - s_waitcnt lgkmcnt(0)
13042 - Must happen before
13047 to global and local
13053 2. buffer/global/flat_atomic
13054 fence release - singlethread *none* *none*
13056 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
13057 vmcnt(0) & vscnt(0)
13059 - If CU wavefront execution
13060 mode, omit vmcnt(0) and
13069 vmcnt(0) and vscnt(0).
13070 - However, since LLVM
13075 always generate. If
13085 - Could be split into
13087 vmcnt(0), s_waitcnt
13088 vscnt(0) and s_waitcnt
13089 lgkmcnt(0) to allow
13091 independently moved
13094 - s_waitcnt vmcnt(0)
13100 atomicrmw-with-return-value.
13101 - s_waitcnt vscnt(0)
13105 store/store atomic/
13106 atomicrmw-no-return-value.
13107 - s_waitcnt lgkmcnt(0)
13112 atomic/store atomic/
13114 - Must happen before
13115 any following store
13119 and memory ordering
13123 fence-paired-atomic).
13130 fence-paired-atomic.
13132 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
13133 - system vmcnt(0) & vscnt(0)
13142 vmcnt(0) and vscnt(0).
13143 - However, since LLVM
13148 always generate. If
13158 - Could be split into
13160 vmcnt(0), s_waitcnt
13161 vscnt(0) and s_waitcnt
13162 lgkmcnt(0) to allow
13164 independently moved
13167 - s_waitcnt vmcnt(0)
13172 atomicrmw-with-return-value.
13173 - s_waitcnt vscnt(0)
13177 store/store atomic/
13178 atomicrmw-no-return-value.
13179 - s_waitcnt lgkmcnt(0)
13186 - Must happen before
13187 any following store
13191 and memory ordering
13195 fence-paired-atomic).
13202 fence-paired-atomic.
13204 **Acquire-Release Atomic**
13205 ------------------------------------------------------------------------------------
13206 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
13207 - wavefront - local
13209 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) &
13210 vmcnt(0) & vscnt(0)
13212 - If CU wavefront execution
13213 mode, omit vmcnt(0) and
13217 - Must happen after
13223 - Could be split into
13225 vmcnt(0), s_waitcnt
13226 vscnt(0), and s_waitcnt
13227 lgkmcnt(0) to allow
13229 independently moved
13232 - s_waitcnt vmcnt(0)
13235 global/generic load/load
13237 atomicrmw-with-return-value.
13238 - s_waitcnt vscnt(0)
13244 atomicrmw-no-return-value.
13245 - s_waitcnt lgkmcnt(0)
13252 - Must happen before
13263 2. buffer/global_atomic
13264 3. s_waitcnt vm/vscnt(0)
13266 - If CU wavefront execution
13268 - Use vmcnt(0) if atomic with
13269 return and vscnt(0) if
13270 atomic with no-return.
13271 - Must happen before
13283 - If CU wavefront execution
13290 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
13292 - If CU wavefront execution
13295 - Could be split into
13297 vmcnt(0) and s_waitcnt
13300 independently moved
13303 - s_waitcnt vmcnt(0)
13306 global/generic load/load
13308 atomicrmw-with-return-value.
13309 - s_waitcnt vscnt(0)
13313 store/store atomic/
13314 atomicrmw-no-return-value.
13315 - Must happen before
13323 store that is being
13327 3. s_waitcnt lgkmcnt(0)
13330 - Must happen before
13336 older than the local load
13342 - If CU wavefront execution
13350 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) &
13351 vmcnt(0) & vscnt(0)
13353 - If CU wavefront execution
13354 mode, omit vmcnt(0) and
13356 - If OpenCL, omit lgkmcnt(0).
13357 - Could be split into
13359 vmcnt(0), s_waitcnt
13360 vscnt(0) and s_waitcnt
13361 lgkmcnt(0) to allow
13363 independently moved
13366 - s_waitcnt vmcnt(0)
13369 global/generic load/load
13371 atomicrmw-with-return-value.
13372 - s_waitcnt vscnt(0)
13378 atomicrmw-no-return-value.
13379 - s_waitcnt lgkmcnt(0)
13386 - Must happen before
13398 3. s_waitcnt lgkmcnt(0) &
13399 vmcnt(0) & vscnt(0)
13401 - If CU wavefront execution
13402 mode, omit vmcnt(0) and
13404 - If OpenCL, omit lgkmcnt(0).
13405 - Must happen before
13411 older than the load
13417 - If CU wavefront execution
13424 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
13425 - system vmcnt(0) & vscnt(0)
13429 - Could be split into
13431 vmcnt(0), s_waitcnt
13432 vscnt(0) and s_waitcnt
13433 lgkmcnt(0) to allow
13435 independently moved
13438 - s_waitcnt vmcnt(0)
13443 atomicrmw-with-return-value.
13444 - s_waitcnt vscnt(0)
13448 store/store atomic/
13449 atomicrmw-no-return-value.
13450 - s_waitcnt lgkmcnt(0)
13457 - Must happen before
13468 2. buffer/global_atomic
13469 3. s_waitcnt vm/vscnt(0)
13471 - Use vmcnt(0) if atomic with
13472 return and vscnt(0) if
13473 atomic with no-return.
13474 - Must happen before
13486 - Must happen before
13496 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
13497 - system vmcnt(0) & vscnt(0)
13501 - Could be split into
13503 vmcnt(0), s_waitcnt
13504 vscnt(0), and s_waitcnt
13505 lgkmcnt(0) to allow
13507 independently moved
13510 - s_waitcnt vmcnt(0)
13515 atomicrmw-with-return-value.
13516 - s_waitcnt vscnt(0)
13520 store/store atomic/
13521 atomicrmw-no-return-value.
13522 - s_waitcnt lgkmcnt(0)
13529 - Must happen before
13541 3. s_waitcnt vm/vscnt(0) &
13546 - Use vmcnt(0) if atomic with
13547 return and vscnt(0) if
13548 atomic with no-return.
13549 - Must happen before
13561 - Must happen before
13571 fence acq_rel - singlethread *none* *none*
13573 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
13574 vmcnt(0) & vscnt(0)
13576 - If CU wavefront execution
13577 mode, omit vmcnt(0) and
13586 vmcnt(0) and vscnt(0).
13596 - Could be split into
13598 vmcnt(0), s_waitcnt
13599 vscnt(0) and s_waitcnt
13600 lgkmcnt(0) to allow
13602 independently moved
13605 - s_waitcnt vmcnt(0)
13611 atomicrmw-with-return-value.
13612 - s_waitcnt vscnt(0)
13616 store/store atomic/
13617 atomicrmw-no-return-value.
13618 - s_waitcnt lgkmcnt(0)
13623 atomic/store atomic/
13625 - Must happen before
13644 and memory ordering
13648 acquire-fence-paired-atomic)
13661 local/generic store
13665 and memory ordering
13669 release-fence-paired-atomic).
13673 - Must happen before
13677 acquire-fence-paired
13678 atomic has completed
13679 before invalidating
13683 locations read must
13687 acquire-fence-paired-atomic.
13691 - If CU wavefront execution
13698 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
13699 - system vmcnt(0) & vscnt(0)
13708 vmcnt(0) and vscnt(0).
13709 - However, since LLVM
13717 - Could be split into
13719 vmcnt(0), s_waitcnt
13720 vscnt(0) and s_waitcnt
13721 lgkmcnt(0) to allow
13723 independently moved
13726 - s_waitcnt vmcnt(0)
13732 atomicrmw-with-return-value.
13733 - s_waitcnt vscnt(0)
13737 store/store atomic/
13738 atomicrmw-no-return-value.
13739 - s_waitcnt lgkmcnt(0)
13746 - Must happen before
13751 global/local/generic
13756 and memory ordering
13760 acquire-fence-paired-atomic)
13762 before invalidating
13772 global/local/generic
13777 and memory ordering
13781 release-fence-paired-atomic).
13789 - Must happen before
13803 **Sequential Consistent Atomic**
13804 ------------------------------------------------------------------------------------
13805 load atomic seq_cst - singlethread - global *Same as corresponding
13806 - wavefront - local load atomic acquire,
13807 - generic except must generate
13808 all instructions even
13810 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) &
13811 - generic vmcnt(0) & vscnt(0)
13813 - If CU wavefront execution
13814 mode, omit vmcnt(0) and
13816 - Could be split into
13818 vmcnt(0), s_waitcnt
13819 vscnt(0), and s_waitcnt
13820 lgkmcnt(0) to allow
13822 independently moved
13825 - s_waitcnt lgkmcnt(0) must
13832 ordering of seq_cst
13838 lgkmcnt(0) and so do
13841 - s_waitcnt vmcnt(0)
13844 global/generic load
13846 atomicrmw-with-return-value
13848 ordering of seq_cst
13857 - s_waitcnt vscnt(0)
13860 global/generic store
13862 atomicrmw-no-return-value
13864 ordering of seq_cst
13876 consistent global/local
13877 memory instructions
13883 prevents reordering
13886 seq_cst load. (Note
13892 followed by a store
13899 release followed by
13902 order. The s_waitcnt
13903 could be placed after
13904 seq_store or before
13907 make the s_waitcnt be
13908 as late as possible
13914 instructions same as
13917 except must generate
13918 all instructions even
13920 load atomic seq_cst - workgroup - local
13922 1. s_waitcnt vmcnt(0) & vscnt(0)
13924 - If CU wavefront execution
13926 - Could be split into
13928 vmcnt(0) and s_waitcnt
13931 independently moved
13934 - s_waitcnt vmcnt(0)
13937 global/generic load
13939 atomicrmw-with-return-value
13941 ordering of seq_cst
13950 - s_waitcnt vscnt(0)
13953 global/generic store
13955 atomicrmw-no-return-value
13957 ordering of seq_cst
13970 memory instructions
13976 prevents reordering
13979 seq_cst load. (Note
13985 followed by a store
13992 release followed by
13995 order. The s_waitcnt
13996 could be placed after
13997 seq_store or before
14000 make the s_waitcnt be
14001 as late as possible
14007 instructions same as
14010 except must generate
14011 all instructions even
14014 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
14015 - system - generic vmcnt(0) & vscnt(0)
14017 - Could be split into
14019 vmcnt(0), s_waitcnt
14020 vscnt(0) and s_waitcnt
14021 lgkmcnt(0) to allow
14023 independently moved
14026 - s_waitcnt lgkmcnt(0)
14033 ordering of seq_cst
14039 lgkmcnt(0) and so do
14042 - s_waitcnt vmcnt(0)
14045 global/generic load
14047 atomicrmw-with-return-value
14049 ordering of seq_cst
14058 - s_waitcnt vscnt(0)
14061 global/generic store
14063 atomicrmw-no-return-value
14065 ordering of seq_cst
14078 memory instructions
14084 prevents reordering
14087 seq_cst load. (Note
14093 followed by a store
14100 release followed by
14103 order. The s_waitcnt
14104 could be placed after
14105 seq_store or before
14108 make the s_waitcnt be
14109 as late as possible
14115 instructions same as
14118 except must generate
14119 all instructions even
14121 store atomic seq_cst - singlethread - global *Same as corresponding
14122 - wavefront - local store atomic release,
14123 - workgroup - generic except must generate
14124 - agent all instructions even
14125 - system for OpenCL.*
14126 atomicrmw seq_cst - singlethread - global *Same as corresponding
14127 - wavefront - local atomicrmw acq_rel,
14128 - workgroup - generic except must generate
14129 - agent all instructions even
14130 - system for OpenCL.*
14131 fence seq_cst - singlethread *none* *Same as corresponding
14132 - wavefront fence acq_rel,
14133 - workgroup except must generate
14134 - agent all instructions even
14135 - system for OpenCL.*
14136 ============ ============ ============== ========== ================================
14138 .. _amdgpu-amdhsa-trap-handler-abi:
14143 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
14144 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
14145 supports the ``s_trap`` instruction. For usage see:
14147 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
14148 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
14149 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
14151 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
14152 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
14154 =================== =============== =============== =======================================
14155 Usage Code Sequence Trap Handler Description
14157 =================== =============== =============== =======================================
14158 reserved ``s_trap 0x00`` Reserved by hardware.
14159 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap``
14160 ``queue_ptr`` intrinsic (not implemented).
14163 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
14164 ``queue_ptr`` the trap instruction. The associated
14165 queue is signalled to put it into the
14166 error state. When the queue is put in
14167 the error state, the waves executing
14168 dispatches on the queue will be
14170 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
14171 as a no-operation. The trap handler
14172 is entered and immediately returns to
14173 continue execution of the wavefront.
14174 - If the debugger is enabled, causes
14175 the debug trap to be reported by the
14176 debugger and the wavefront is put in
14177 the halt state with the PC at the
14178 instruction. The debugger must
14179 increment the PC and resume the wave.
14180 reserved ``s_trap 0x04`` Reserved.
14181 reserved ``s_trap 0x05`` Reserved.
14182 reserved ``s_trap 0x06`` Reserved.
14183 reserved ``s_trap 0x07`` Reserved.
14184 reserved ``s_trap 0x08`` Reserved.
14185 reserved ``s_trap 0xfe`` Reserved.
14186 reserved ``s_trap 0xff`` Reserved.
14187 =================== =============== =============== =======================================
14191 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
14192 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
14194 =================== =============== =============== =======================================
14195 Usage Code Sequence Trap Handler Description
14197 =================== =============== =============== =======================================
14198 reserved ``s_trap 0x00`` Reserved by hardware.
14199 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for
14200 breakpoints. Causes wave to be halted
14201 with the PC at the trap instruction.
14202 The debugger is responsible to resume
14203 the wave, including the instruction
14204 that the breakpoint overwrote.
14205 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
14206 ``queue_ptr`` the trap instruction. The associated
14207 queue is signalled to put it into the
14208 error state. When the queue is put in
14209 the error state, the waves executing
14210 dispatches on the queue will be
14212 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
14213 as a no-operation. The trap handler
14214 is entered and immediately returns to
14215 continue execution of the wavefront.
14216 - If the debugger is enabled, causes
14217 the debug trap to be reported by the
14218 debugger and the wavefront is put in
14219 the halt state with the PC at the
14220 instruction. The debugger must
14221 increment the PC and resume the wave.
14222 reserved ``s_trap 0x04`` Reserved.
14223 reserved ``s_trap 0x05`` Reserved.
14224 reserved ``s_trap 0x06`` Reserved.
14225 reserved ``s_trap 0x07`` Reserved.
14226 reserved ``s_trap 0x08`` Reserved.
14227 reserved ``s_trap 0xfe`` Reserved.
14228 reserved ``s_trap 0xff`` Reserved.
14229 =================== =============== =============== =======================================
14233 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
14234 :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
14236 =================== =============== ================ ================= =======================================
14237 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
14238 =================== =============== ================ ================= =======================================
14239 reserved ``s_trap 0x00`` Reserved by hardware.
14240 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for
14241 breakpoints. Causes wave to be halted
14242 with the PC at the trap instruction.
14243 The debugger is responsible to resume
14244 the wave, including the instruction
14245 that the breakpoint overwrote.
14246 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at
14247 ``queue_ptr`` the trap instruction. The associated
14248 queue is signalled to put it into the
14249 error state. When the queue is put in
14250 the error state, the waves executing
14251 dispatches on the queue will be
14253 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves
14254 as a no-operation. The trap handler
14255 is entered and immediately returns to
14256 continue execution of the wavefront.
14257 - If the debugger is enabled, causes
14258 the debug trap to be reported by the
14259 debugger and the wavefront is put in
14260 the halt state with the PC at the
14261 instruction. The debugger must
14262 increment the PC and resume the wave.
14263 reserved ``s_trap 0x04`` Reserved.
14264 reserved ``s_trap 0x05`` Reserved.
14265 reserved ``s_trap 0x06`` Reserved.
14266 reserved ``s_trap 0x07`` Reserved.
14267 reserved ``s_trap 0x08`` Reserved.
14268 reserved ``s_trap 0xfe`` Reserved.
14269 reserved ``s_trap 0xff`` Reserved.
14270 =================== =============== ================ ================= =======================================
14272 .. _amdgpu-amdhsa-function-call-convention:
14279 This section is currently incomplete and has inaccuracies. It is WIP that will
14280 be updated as information is determined.
14282 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
14283 addresses. Unswizzled addresses are normal linear addresses.
14285 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
14290 This section describes the call convention ABI for the outer kernel function.
14292 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
14295 The following is not part of the AMDGPU kernel calling convention but describes
14296 how the AMDGPU implements function calls:
14298 1. Clang decides the kernarg layout to match the *HSA Programmer's Language
14301 - All structs are passed directly.
14302 - Lambda values are passed *TBA*.
14306 - Does this really follow HSA rules? Or are structs >16 bytes passed
14308 - What is ABI for lambda values?
14310 4. The kernel performs certain setup in its prolog, as described in
14311 :ref:`amdgpu-amdhsa-kernel-prolog`.
14313 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
14315 Non-Kernel Functions
14316 ++++++++++++++++++++
14318 This section describes the call convention ABI for functions other than the
14319 outer kernel function.
14321 If a kernel has function calls then scratch is always allocated and used for
14322 the call stack which grows from low address to high address using the swizzled
14323 scratch address space.
14325 On entry to a function:
14327 1. SGPR0-3 contain a V# with the following properties (see
14328 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
14330 * Base address pointing to the beginning of the wavefront scratch backing
14332 * Swizzled with dword element size and stride of wavefront size elements.
14334 2. The FLAT_SCRATCH register pair is setup. See
14335 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
14336 3. GFX6-GFX8: M0 register set to the size of LDS in bytes. See
14337 :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
14338 4. The EXEC register is set to the lanes active on entry to the function.
14339 5. MODE register: *TBD*
14340 6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
14342 7. SGPR30-31 return address (RA). The code address that the function must
14343 return to when it completes. The value is undefined if the function is *no
14345 8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
14346 offset relative to the beginning of the wavefront scratch backing memory.
14348 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
14349 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
14352 The unswizzled SP value can be converted into the swizzled SP value by:
14354 | swizzled SP = unswizzled SP / wavefront size
14356 This may be used to obtain the private address space address of stack
14357 objects and to convert this address to a flat address by adding the flat
14358 scratch aperture base address.
14360 The swizzled SP value is always 4 bytes aligned for the ``r600``
14361 architecture and 16 byte aligned for the ``amdgcn`` architecture.
14365 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
14366 OpenCL language which has the largest base type defined as 16 bytes.
14368 On entry, the swizzled SP value is the address of the first function
14369 argument passed on the stack. Other stack passed arguments are positive
14370 offsets from the entry swizzled SP value.
14372 The function may use positive offsets beyond the last stack passed argument
14373 for stack allocated local variables and register spill slots. If necessary,
14374 the function may align these to greater alignment than 16 bytes. After these
14375 the function may dynamically allocate space for such things as runtime sized
14376 ``alloca`` local allocations.
14378 If the function calls another function, it will place any stack allocated
14379 arguments after the last local allocation and adjust SGPR32 to the address
14380 after the last local allocation.
14382 9. All other registers are unspecified.
14383 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
14385 11. Use pass-by-reference (byref) in stead of pass-by-value (byval) for struct
14386 arguments in C ABI. Callee is responsible for allocating stack memory and
14387 copying the value of the struct if modified. Note that the backend still
14388 supports byval for struct arguments.
14390 On exit from a function:
14392 1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
14393 described below. Any registers used are considered clobbered registers.
14394 2. The following registers are preserved and have the same value as on entry:
14399 * All SGPR registers except the clobbered registers of SGPR4-31.
14417 Except the argument registers, the VGPRs clobbered and the preserved
14418 registers are intermixed at regular intervals in order to keep a
14419 similar ratio independent of the number of allocated VGPRs.
14421 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
14422 * Lanes of all VGPRs that are inactive at the call site.
14424 For the AMDGPU backend, an inter-procedural register allocation (IPRA)
14425 optimization may mark some of clobbered SGPR and VGPR registers as
14426 preserved if it can be determined that the called function does not change
14429 2. The PC is set to the RA provided on entry.
14430 3. MODE register: *TBD*.
14431 4. All other registers are clobbered.
14432 5. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
14433 function is available to the caller.
14437 - How are function results returned? The address of structured types is passed
14438 by reference, but what about other types?
14440 The function input arguments are made up of the formal arguments explicitly
14441 declared by the source language function plus the implicit input arguments used
14442 by the implementation.
14444 The source language input arguments are:
14446 1. Any source language implicit ``this`` or ``self`` argument comes first as a
14448 2. Followed by the function formal arguments in left to right source order.
14450 The source language result arguments are:
14452 1. The function result argument.
14454 The source language input or result struct type arguments that are less than or
14455 equal to 16 bytes, are decomposed recursively into their base type fields, and
14456 each field is passed as if a separate argument. For input arguments, if the
14457 called function requires the struct to be in memory, for example because its
14458 address is taken, then the function body is responsible for allocating a stack
14459 location and copying the field arguments into it. Clang terms this *direct
14462 The source language input struct type arguments that are greater than 16 bytes,
14463 are passed by reference. The caller is responsible for allocating a stack
14464 location to make a copy of the struct value and pass the address as the input
14465 argument. The called function is responsible to perform the dereference when
14466 accessing the input argument. Clang terms this *by-value struct*.
14468 A source language result struct type argument that is greater than 16 bytes, is
14469 returned by reference. The caller is responsible for allocating a stack location
14470 to hold the result value and passes the address as the last input argument
14471 (before the implicit input arguments). In this case there are no result
14472 arguments. The called function is responsible to perform the dereference when
14473 storing the result value. Clang terms this *structured return (sret)*.
14475 *TODO: correct the ``sret`` definition.*
14479 Is this definition correct? Or is ``sret`` only used if passing in registers, and
14480 pass as non-decomposed struct as stack argument? Or something else? Is the
14481 memory location in the caller stack frame, or a stack memory argument and so
14482 no address is passed as the caller can directly write to the argument stack
14483 location? But then the stack location is still live after return. If an
14484 argument stack location is it the first stack argument or the last one?
14486 Lambda argument types are treated as struct types with an implementation defined
14491 Need to specify the ABI for lambda types for AMDGPU.
14493 For AMDGPU backend all source language arguments (including the decomposed
14494 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
14495 they are passed in SGPRs.
14497 The AMDGPU backend walks the function call graph from the leaves to determine
14498 which implicit input arguments are used, propagating to each caller of the
14499 function. The used implicit arguments are appended to the function arguments
14500 after the source language arguments in the following order:
14504 Is recursion or external functions supported?
14506 1. Work-Item ID (1 VGPR)
14508 The X, Y and Z work-item ID are packed into a single VGRP with the following
14509 layout. Only fields actually used by the function are set. The other bits
14512 The values come from the initial kernel execution state. See
14513 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14515 .. table:: Work-item implicit argument layout
14516 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
14518 ======= ======= ==============
14519 Bits Size Field Name
14520 ======= ======= ==============
14521 9:0 10 bits X Work-Item ID
14522 19:10 10 bits Y Work-Item ID
14523 29:20 10 bits Z Work-Item ID
14524 31:30 2 bits Unused
14525 ======= ======= ==============
14527 2. Dispatch Ptr (2 SGPRs)
14529 The value comes from the initial kernel execution state. See
14530 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14532 3. Queue Ptr (2 SGPRs)
14534 The value comes from the initial kernel execution state. See
14535 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14537 4. Kernarg Segment Ptr (2 SGPRs)
14539 The value comes from the initial kernel execution state. See
14540 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14542 5. Dispatch id (2 SGPRs)
14544 The value comes from the initial kernel execution state. See
14545 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14547 6. Work-Group ID X (1 SGPR)
14549 The value comes from the initial kernel execution state. See
14550 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14552 7. Work-Group ID Y (1 SGPR)
14554 The value comes from the initial kernel execution state. See
14555 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14557 8. Work-Group ID Z (1 SGPR)
14559 The value comes from the initial kernel execution state. See
14560 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14562 9. Implicit Argument Ptr (2 SGPRs)
14564 The value is computed by adding an offset to Kernarg Segment Ptr to get the
14565 global address space pointer to the first kernarg implicit argument.
14567 The input and result arguments are assigned in order in the following manner:
14571 There are likely some errors and omissions in the following description that
14576 Check the Clang source code to decipher how function arguments and return
14577 results are handled. Also see the AMDGPU specific values used.
14579 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
14582 If there are more arguments than will fit in these registers, the remaining
14583 arguments are allocated on the stack in order on naturally aligned
14588 How are overly aligned structures allocated on the stack?
14590 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
14593 If there are more arguments than will fit in these registers, the remaining
14594 arguments are allocated on the stack in order on naturally aligned
14597 Note that decomposed struct type arguments may have some fields passed in
14598 registers and some in memory.
14602 So, a struct which can pass some fields as decomposed register arguments, will
14603 pass the rest as decomposed stack elements? But an argument that will not start
14604 in registers will not be decomposed and will be passed as a non-decomposed
14607 The following is not part of the AMDGPU function calling convention but
14608 describes how the AMDGPU implements function calls:
14610 1. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
14611 unswizzled scratch address. It is only needed if runtime sized ``alloca``
14612 are used, or for the reasons defined in ``SIFrameLowering``.
14613 2. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
14614 to access the incoming stack arguments in the function. The BP is needed
14615 only when the function requires the runtime stack alignment.
14617 3. Allocating SGPR arguments on the stack are not supported.
14619 4. No CFI is currently generated. See
14620 :ref:`amdgpu-dwarf-call-frame-information`.
14624 CFI will be generated that defines the CFA as the unswizzled address
14625 relative to the wave scratch base in the unswizzled private address space
14626 of the lowest address stack allocated local variable.
14628 ``DW_AT_frame_base`` will be defined as the swizzled address in the
14629 swizzled private address space by dividing the CFA by the wavefront size
14630 (since CFA is always at least dword aligned which matches the scratch
14631 swizzle element size).
14633 If no dynamic stack alignment was performed, the stack allocated arguments
14634 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
14635 local variables and register spill slots are accessed as positive offsets
14636 relative to ``DW_AT_frame_base``.
14638 5. Function argument passing is implemented by copying the input physical
14639 registers to virtual registers on entry. The register allocator can spill if
14640 necessary. These are copied back to physical registers at call sites. The
14641 net effect is that each function call can have these values in entirely
14642 distinct locations. The IPRA can help avoid shuffling argument registers.
14643 6. Call sites are implemented by setting up the arguments at positive offsets
14644 from SP. Then SP is incremented to account for the known frame size before
14645 the call and decremented after the call.
14649 The CFI will reflect the changed calculation needed to compute the CFA
14652 7. 4 byte spill slots are used in the stack frame. One slot is allocated for an
14653 emergency spill slot. Buffer instructions are used for stack accesses and
14654 not the ``flat_scratch`` instruction.
14658 Explain when the emergency spill slot is used.
14662 Possible broken issues:
14664 - Stack arguments must be aligned to required alignment.
14665 - Stack is aligned to max(16, max formal argument alignment)
14666 - Direct argument < 64 bits should check register budget.
14667 - Register budget calculation should respect ``inreg`` for SGPR.
14668 - SGPR overflow is not handled.
14669 - struct with 1 member unpeeling is not checking size of member.
14670 - ``sret`` is after ``this`` pointer.
14671 - Caller is not implementing stack realignment: need an extra pointer.
14672 - Should say AMDGPU passes FP rather than SP.
14673 - Should CFI define CFA as address of locals or arguments. Difference is
14674 apparent when have implemented dynamic alignment.
14675 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
14676 highest address of stack frame and use negative offset for locals. Would
14677 allow SP to be the same as FP and could support signal-handler-like as now
14678 have a real SP for the top of the stack.
14679 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
14685 This section provides code conventions used when the target triple OS is
14686 ``amdpal`` (see :ref:`amdgpu-target-triples`).
14688 .. _amdgpu-amdpal-code-object-metadata-section:
14690 Code Object Metadata
14691 ~~~~~~~~~~~~~~~~~~~~
14695 The metadata is currently in development and is subject to major
14696 changes. Only the current version is supported. *When this document
14697 was generated the version was 2.6.*
14699 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
14700 record (see :ref:`amdgpu-note-records-v3-onwards`).
14702 The metadata is represented as Message Pack formatted binary data (see
14703 [MsgPack]_). The top level is a Message Pack map that includes the keys
14704 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
14705 and referenced tables.
14707 Additional information can be added to the maps. To avoid conflicts, any
14708 key names should be prefixed by "*vendor-name*." where ``vendor-name``
14709 can be the name of the vendor and specific vendor tool that generates the
14710 information. The prefix is abbreviated to simply "." when it appears
14711 within a map that has been added by the same *vendor-name*.
14713 .. table:: AMDPAL Code Object Metadata Map
14714 :name: amdgpu-amdpal-code-object-metadata-map-table
14716 =================== ============== ========= ======================================================================
14717 String Key Value Type Required? Description
14718 =================== ============== ========= ======================================================================
14719 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values
14720 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
14721 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See
14722 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
14723 definition of the keys included in that map.
14724 =================== ============== ========= ======================================================================
14728 .. table:: AMDPAL Code Object Pipeline Metadata Map
14729 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
14731 ====================================== ============== ========= ===================================================
14732 String Key Value Type Required? Description
14733 ====================================== ============== ========= ===================================================
14734 ".name" string Source name of the pipeline.
14735 ".type" string Pipeline type, e.g. VsPs. Values include:
14745 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower
14746 2 integers 64 bits is the "stable" portion of the hash, used
14747 for e.g. shader replacement lookup. Upper 64 bits
14748 is the "unique" portion of the hash, used for
14749 e.g. pipeline cache lookup. The value is
14750 implementation defined, and can not be relied on
14751 between different builds of the compiler.
14752 ".shaders" map Per-API shader metadata. See
14753 :ref:`amdgpu-amdpal-code-object-shader-map-table`
14754 for the definition of the keys included in that
14756 ".hardware_stages" map Per-hardware stage metadata. See
14757 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
14758 for the definition of the keys included in that
14760 ".shader_functions" map Per-shader function metadata. See
14761 :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
14762 for the definition of the keys included in that
14764 ".registers" map Required Hardware register configuration. See
14765 :ref:`amdgpu-amdpal-code-object-register-map-table`
14766 for the definition of the keys included in that
14768 ".user_data_limit" integer Number of user data entries accessed by this
14770 ".spill_threshold" integer The user data spill threshold. 0xFFFF for
14771 NoUserDataSpilling.
14772 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the
14773 viewport array index feature. Pipelines which use
14774 this feature can render into all 16 viewports,
14775 whereas pipelines which do not use it are
14776 restricted to viewport #0.
14777 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for
14778 handling data-passing between the ES and GS
14779 shader stages. This can be zero if the data is
14780 passed using off-chip buffers. This value should
14781 be used to program all user-SGPRs which have been
14782 marked with "UserDataMapping::EsGsLdsSize"
14783 (typically only the GS and VS HW stages will ever
14784 have a user-SGPR so marked).
14785 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders
14786 (maximum number of threads in a subgroup).
14787 ".num_interpolants" integer Graphics only. Number of PS interpolants.
14788 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used.
14789 ".api" string Name of the client graphics API.
14790 ".api_create_info" binary Graphics API shader create info binary blob. Can
14791 be defined by the driver using the compiler if
14792 they want to be able to correlate API-specific
14793 information used during creation at a later time.
14794 ====================================== ============== ========= ===================================================
14798 .. table:: AMDPAL Code Object Shader Map
14799 :name: amdgpu-amdpal-code-object-shader-map-table
14802 +-------------+--------------+-------------------------------------------------------------------+
14803 |String Key |Value Type |Description |
14804 +=============+==============+===================================================================+
14805 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
14806 |- ".vertex" | |for the definition of the keys included in that map. |
14809 |- ".geometry"| | |
14811 +-------------+--------------+-------------------------------------------------------------------+
14815 .. table:: AMDPAL Code Object API Shader Metadata Map
14816 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
14818 ==================== ============== ========= =====================================================================
14819 String Key Value Type Required? Description
14820 ==================== ============== ========= =====================================================================
14821 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value
14822 2 integers is implementation defined, and can not be relied on between
14823 different builds of the compiler.
14824 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values
14835 ==================== ============== ========= =====================================================================
14839 .. table:: AMDPAL Code Object Hardware Stage Map
14840 :name: amdgpu-amdpal-code-object-hardware-stage-map-table
14842 +-------------+--------------+-----------------------------------------------------------------------+
14843 |String Key |Value Type |Description |
14844 +=============+==============+=======================================================================+
14845 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
14846 |- ".hs" | |for the definition of the keys included in that map. |
14852 +-------------+--------------+-----------------------------------------------------------------------+
14856 .. table:: AMDPAL Code Object Hardware Stage Metadata Map
14857 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
14859 ========================== ============== ========= ===============================================================
14860 String Key Value Type Required? Description
14861 ========================== ============== ========= ===============================================================
14862 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
14863 ".scratch_memory_size" integer Scratch memory size in bytes.
14864 ".lds_size" integer Local Data Share size in bytes.
14865 ".perf_data_buffer_size" integer Performance data buffer size in bytes.
14866 ".vgpr_count" integer Number of VGPRs used.
14867 ".agpr_count" integer Number of AGPRs used.
14868 ".sgpr_count" integer Number of SGPRs used.
14869 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
14870 directive to instruct the compiler to limit the VGPR usage to
14871 be less than or equal to the specified value (only set if
14872 different from HW default).
14873 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW
14875 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
14877 ".wavefront_size" integer Wavefront size (only set if different from HW default).
14878 ".uses_uavs" boolean The shader reads or writes UAVs.
14879 ".uses_rovs" boolean The shader reads or writes ROVs.
14880 ".writes_uavs" boolean The shader writes to one or more UAVs.
14881 ".writes_depth" boolean The shader writes out a depth value.
14882 ".uses_append_consume" boolean The shader uses append and/or consume operations, either
14884 ".uses_prim_id" boolean The shader uses PrimID.
14885 ========================== ============== ========= ===============================================================
14889 .. table:: AMDPAL Code Object Shader Function Map
14890 :name: amdgpu-amdpal-code-object-shader-function-map-table
14892 =============== ============== ====================================================================
14893 String Key Value Type Description
14894 =============== ============== ====================================================================
14895 *symbol name* map *symbol name* is the ELF symbol name of the shader function code
14896 entry address. The value is the function's metadata. See
14897 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
14898 =============== ============== ====================================================================
14902 .. table:: AMDPAL Code Object Shader Function Metadata Map
14903 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
14905 ============================= ============== =================================================================
14906 String Key Value Type Description
14907 ============================= ============== =================================================================
14908 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value
14909 2 integers is implementation defined, and can not be relied on between
14910 different builds of the compiler.
14911 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader.
14912 ".lds_size" integer Size in bytes of LDS memory.
14913 ".vgpr_count" integer Number of VGPRs used by the shader.
14914 ".sgpr_count" integer Number of SGPRs used by the shader.
14915 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader.
14916 ".shader_subtype" string Shader subtype/kind. Values include:
14920 ============================= ============== =================================================================
14924 .. table:: AMDPAL Code Object Register Map
14925 :name: amdgpu-amdpal-code-object-register-map-table
14927 ========================== ============== ====================================================================
14928 32-bit Integer Key Value Type Description
14929 ========================== ============== ====================================================================
14930 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
14931 a GRBM register (i.e., driver accessible GPU register number, not
14932 shader GPR register number). The driver is required to program each
14933 specified register to the corresponding specified value when
14934 executing this pipeline. Typically, the ``reg offsets`` are the
14935 ``uint16_t`` offsets to each register as defined by the hardware
14936 chip headers. The register is set to the provided value. However, a
14937 ``reg offset`` that specifies a user data register (e.g.,
14938 COMPUTE_USER_DATA_0) needs special treatment. See
14939 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
14941 ========================== ============== ====================================================================
14943 .. _amdgpu-amdpal-code-object-user-data-section:
14948 Each hardware stage has a set of 32-bit physical SPI *user data registers*
14949 (either 16 or 32 based on graphics IP and the stage) which can be
14950 written from a command buffer and then loaded into SGPRs when waves are
14951 launched via a subsequent dispatch or draw operation. This is the way
14952 most arguments are passed from the application/runtime to a hardware
14955 PAL abstracts this functionality by exposing a set of 128 *user data
14956 entries* per pipeline a client can use to pass arguments from a command
14957 buffer to one or more shaders in that pipeline. The ELF code object must
14958 specify a mapping from virtualized *user data entries* to physical *user
14959 data registers*, and PAL is responsible for implementing that mapping,
14960 including spilling overflow *user data entries* to memory if needed.
14962 Since the *user data registers* are GRBM-accessible SPI registers, this
14963 mapping is actually embedded in the ``.registers`` metadata entry. For
14964 most registers, the value in that map is a literal 32-bit value that
14965 should be written to the register by the driver. However, when the
14966 register is a *user data register* (any USER_DATA register e.g.,
14967 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
14968 the driver to write either a *user data entry* value or one of several
14969 driver-internal values to the register. This encoding is described in
14970 the following table:
14974 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
14975 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
14976 always be programmed to the address of the GlobalTable, and *user data
14977 register* 1 must always be programmed to the address of the PerShaderTable.
14981 .. table:: AMDPAL User Data Mapping
14982 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14984 ========== ================= ===============================================================================
14985 Value Name Description
14986 ========== ================= ===============================================================================
14987 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14988 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should
14989 always point to *user data register* 0).
14990 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See
14991 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14992 for more detail (should always point to *user data register* 1).
14993 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See
14994 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14996 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14997 reference the draw index in the vertex shader. Only supported by the first
14998 stage in a graphics pipeline.
14999 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in
15000 a graphics pipeline.
15001 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a
15003 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
15004 a buffer containing the grid dimensions for a Compute dispatch operation. The
15005 high half of the address is stored in the next sequential user-SGPR. Only
15006 supported by compute pipelines.
15007 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS
15008 space used for the ES/GS pseudo-ring-buffer for passing data between shader
15010 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic
15011 pipeline instancing.
15012 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This
15013 can only appear for one shader stage per pipeline.
15014 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer.
15015 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can
15016 only appear for one shader stage per pipeline.
15017 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can
15018 only appear for one shader stage per pipeline (PS). These replace color targets
15019 and are completely separate from any UAVs used by the shader. This is optional,
15020 and only used by the PS when UAV exports are used to replace color-target
15021 exports to optimize specific shaders.
15022 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by
15023 some NGG pipelines to perform culling. This value contains the address of the
15024 first of two consecutive registers which provide the full GPU address.
15025 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine.
15026 ========== ================= ===============================================================================
15028 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
15033 Low 32 bits of the GPU address for an optional buffer in the ``.data``
15034 section of the ELF. The high 32 bits of the address match the high 32 bits
15035 of the shader's program counter.
15037 The buffer can be anything the shader compiler needs it for, and
15038 allows each shader to have its own region of the ``.data`` section.
15039 Typically, this could be a table of buffer SRD's and the data pointed to
15040 by the buffer SRD's, but it could be a flat-address region of memory as
15041 well. Its layout and usage are defined by the shader compiler.
15043 Each shader's table in the ``.data`` section is referenced by the symbol
15044 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the
15045 hardware shader stage the data is for. E.g.,
15046 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
15048 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
15053 It is possible for a hardware shader to need access to more *user data
15054 entries* than there are slots available in user data registers for one
15055 or more hardware shader stages. In that case, the PAL runtime expects
15056 the necessary *user data entries* to be spilled to GPU memory and use
15057 one user data register to point to the spilled user data memory. The
15058 value of the *user data entry* must then represent the location where
15059 a shader expects to read the low 32-bits of the table's GPU virtual
15060 address. The *spill table* itself represents a set of 32-bit values
15061 managed by the PAL runtime in GPU-accessible memory that can be made
15062 indirectly accessible to a hardware shader.
15067 This section provides code conventions used when the target triple OS is
15068 empty (see :ref:`amdgpu-target-triples`).
15073 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
15074 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
15075 instructions are handled as follows:
15077 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
15078 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
15080 =============== =============== ===========================================
15081 Usage Code Sequence Description
15082 =============== =============== ===========================================
15083 llvm.trap s_endpgm Causes wavefront to be terminated.
15084 llvm.debugtrap *none* Compiler warning given that there is no
15085 trap handler installed.
15086 =============== =============== ===========================================
15096 When the language is OpenCL the following differences occur:
15098 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
15099 2. The AMDGPU backend appends additional arguments to the kernel's explicit
15100 arguments for the AMDHSA OS (see
15101 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
15102 3. Additional metadata is generated
15103 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
15105 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
15106 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
15108 ======== ==== ========= ===========================================
15109 Position Byte Byte Description
15111 ======== ==== ========= ===========================================
15112 1 8 8 OpenCL Global Offset X
15113 2 8 8 OpenCL Global Offset Y
15114 3 8 8 OpenCL Global Offset Z
15115 4 8 8 OpenCL address of printf buffer
15116 5 8 8 OpenCL address of virtual queue used by
15118 6 8 8 OpenCL address of AqlWrap struct used by
15120 7 8 8 Pointer argument used for Multi-gird
15122 ======== ==== ========= ===========================================
15129 When the language is HCC the following differences occur:
15131 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
15133 .. _amdgpu-assembler:
15138 AMDGPU backend has LLVM-MC based assembler which is currently in development.
15139 It supports AMDGCN GFX6-GFX11.
15141 This section describes general syntax for instructions and operands.
15146 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
15148 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
15149 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
15151 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
15152 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
15154 The order of operands and modifiers is fixed.
15155 Most modifiers are optional and may be omitted.
15157 Links to detailed instruction syntax description may be found in the following
15158 table. Note that features under development are not included
15159 in this description.
15161 ============= ============================================= =======================================
15162 Architecture Core ISA ISA Variants and Extensions
15163 ============= ============================================= =======================================
15164 GCN 2 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \-
15165 GCN 3, GCN 4 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \-
15166 GCN 5 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
15168 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
15170 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
15172 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
15174 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
15176 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
15178 CDNA 1 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
15180 CDNA 2 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
15182 CDNA 3 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
15184 :doc:`gfx941<AMDGPU/AMDGPUAsmGFX940>`
15186 :doc:`gfx942<AMDGPU/AMDGPUAsmGFX940>`
15188 RDNA 1 :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
15190 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
15192 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
15194 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
15196 RDNA 2 :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>` :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
15198 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
15200 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
15202 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
15204 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
15206 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
15208 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
15210 RDNA 3 :doc:`GFX11<AMDGPU/AMDGPUAsmGFX11>` :doc:`gfx1100<AMDGPU/AMDGPUAsmGFX11>`
15212 :doc:`gfx1101<AMDGPU/AMDGPUAsmGFX11>`
15214 :doc:`gfx1102<AMDGPU/AMDGPUAsmGFX11>`
15216 :doc:`gfx1103<AMDGPU/AMDGPUAsmGFX11>`
15217 ============= ============================================= =======================================
15219 For more information about instructions, their semantics and supported
15220 combinations of operands, refer to one of instruction set architecture manuals
15221 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
15222 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
15223 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_,
15224 [AMD-GCN-GFX940-GFX942-CDNA3]_, [AMD-GCN-GFX10-RDNA1]_, [AMD-GCN-GFX10-RDNA2]_
15225 and [AMD-GCN-GFX11-RDNA3]_.
15230 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
15235 Detailed description of modifiers may be found
15236 :doc:`here<AMDGPUModifierSyntax>`.
15238 Instruction Examples
15239 ~~~~~~~~~~~~~~~~~~~~
15244 .. code-block:: nasm
15246 ds_add_u32 v2, v4 offset:16
15247 ds_write_src2_b64 v2 offset0:4 offset1:8
15248 ds_cmpst_f32 v2, v4, v6
15249 ds_min_rtn_f64 v[8:9], v2, v[4:5]
15251 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
15257 .. code-block:: nasm
15259 flat_load_dword v1, v[3:4]
15260 flat_store_dwordx3 v[3:4], v[5:7]
15261 flat_atomic_swap v1, v[3:4], v5 glc
15262 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
15263 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
15265 For full list of supported instructions, refer to "FLAT instructions" in ISA
15271 .. code-block:: nasm
15273 buffer_load_dword v1, off, s[4:7], s1
15274 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
15275 buffer_store_format_xy v[1:2], off, s[4:7], s1
15277 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
15279 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
15285 .. code-block:: nasm
15287 s_load_dword s1, s[2:3], 0xfc
15288 s_load_dwordx8 s[8:15], s[2:3], s4
15289 s_load_dwordx16 s[88:103], s[2:3], s4
15293 For full list of supported instructions, refer to "Scalar Memory Operations" in
15299 .. code-block:: nasm
15302 s_mov_b64 s[0:1], 0x80000000
15304 s_wqm_b64 s[2:3], s[4:5]
15305 s_bcnt0_i32_b64 s1, s[2:3]
15306 s_swappc_b64 s[2:3], s[4:5]
15307 s_cbranch_join s[4:5]
15309 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
15315 .. code-block:: nasm
15317 s_add_u32 s1, s2, s3
15318 s_and_b64 s[2:3], s[4:5], s[6:7]
15319 s_cselect_b32 s1, s2, s3
15320 s_andn2_b32 s2, s4, s6
15321 s_lshr_b64 s[2:3], s[4:5], s6
15322 s_ashr_i32 s2, s4, s6
15323 s_bfm_b64 s[2:3], s4, s6
15324 s_bfe_i64 s[2:3], s[4:5], s6
15325 s_cbranch_g_fork s[4:5], s[6:7]
15327 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
15333 .. code-block:: nasm
15335 s_cmp_eq_i32 s1, s2
15336 s_bitcmp1_b32 s1, s2
15337 s_bitcmp0_b64 s[2:3], s4
15340 For full list of supported instructions, refer to "SOPC Instructions" in ISA
15346 .. code-block:: nasm
15351 s_waitcnt 0 ; Wait for all counters to be 0
15352 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
15353 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
15357 s_sendmsg sendmsg(MSG_INTERRUPT)
15360 For full list of supported instructions, refer to "SOPP Instructions" in ISA
15363 Unless otherwise mentioned, little verification is performed on the operands
15364 of SOPP Instructions, so it is up to the programmer to be familiar with the
15365 range or acceptable values.
15370 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
15371 the assembler will automatically use optimal encoding based on its operands. To
15372 force specific encoding, one can add a suffix to the opcode of the instruction:
15374 * _e32 for 32-bit VOP1/VOP2/VOPC
15375 * _e64 for 64-bit VOP3
15377 * _e64_dpp for VOP3 with DPP
15378 * _sdwa for VOP_SDWA
15380 VOP1/VOP2/VOP3/VOPC examples:
15382 .. code-block:: nasm
15385 v_mov_b32_e32 v1, v2
15387 v_cvt_f64_i32_e32 v[1:2], v2
15388 v_floor_f32_e32 v1, v2
15389 v_bfrev_b32_e32 v1, v2
15390 v_add_f32_e32 v1, v2, v3
15391 v_mul_i32_i24_e64 v1, v2, 3
15392 v_mul_i32_i24_e32 v1, -3, v3
15393 v_mul_i32_i24_e32 v1, -100, v3
15394 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
15395 v_max_f16_e32 v1, v2, v3
15399 .. code-block:: nasm
15401 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
15402 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15403 v_mov_b32 v0, v0 wave_shl:1
15404 v_mov_b32 v0, v0 row_mirror
15405 v_mov_b32 v0, v0 row_bcast:31
15406 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
15407 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15408 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15411 VOP3_DPP examples (Available on GFX11+):
15413 .. code-block:: nasm
15415 v_add_f32_e64_dpp v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7]
15416 v_sqrt_f32_e64_dpp v0, v1 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15417 v_ldexp_f32 v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7]
15421 .. code-block:: nasm
15423 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
15424 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
15425 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
15426 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
15427 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
15429 For full list of supported instructions, refer to "Vector ALU instructions".
15431 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
15433 Code Object V2 Predefined Symbols
15434 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15437 Code object V2 generation is no longer supported by this version of LLVM.
15439 The AMDGPU assembler defines and updates some symbols automatically. These
15440 symbols do not affect code generation.
15442 .option.machine_version_major
15443 +++++++++++++++++++++++++++++
15445 Set to the GFX major generation number of the target being assembled for. For
15446 example, when assembling for a "GFX9" target this will be set to the integer
15447 value "9". The possible GFX major generation numbers are presented in
15448 :ref:`amdgpu-processors`.
15450 .option.machine_version_minor
15451 +++++++++++++++++++++++++++++
15453 Set to the GFX minor generation number of the target being assembled for. For
15454 example, when assembling for a "GFX810" target this will be set to the integer
15455 value "1". The possible GFX minor generation numbers are presented in
15456 :ref:`amdgpu-processors`.
15458 .option.machine_version_stepping
15459 ++++++++++++++++++++++++++++++++
15461 Set to the GFX stepping generation number of the target being assembled for.
15462 For example, when assembling for a "GFX704" target this will be set to the
15463 integer value "4". The possible GFX stepping generation numbers are presented
15464 in :ref:`amdgpu-processors`.
15469 Set to zero each time a
15470 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
15471 encountered. At each instruction, if the current value of this symbol is less
15472 than or equal to the maximum VGPR number explicitly referenced within that
15473 instruction then the symbol value is updated to equal that VGPR number plus
15479 Set to zero each time a
15480 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
15481 encountered. At each instruction, if the current value of this symbol is less
15482 than or equal to the maximum VGPR number explicitly referenced within that
15483 instruction then the symbol value is updated to equal that SGPR number plus
15486 .. _amdgpu-amdhsa-assembler-directives-v2:
15488 Code Object V2 Directives
15489 ~~~~~~~~~~~~~~~~~~~~~~~~~
15492 Code object V2 generation is no longer supported by this version of LLVM.
15494 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
15495 one can specify them with assembler directives.
15497 .hsa_code_object_version major, minor
15498 +++++++++++++++++++++++++++++++++++++
15500 *major* and *minor* are integers that specify the version of the HSA code
15501 object that will be generated by the assembler.
15503 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
15504 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15507 *major*, *minor*, and *stepping* are all integers that describe the instruction
15508 set architecture (ISA) version of the assembly program.
15510 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
15511 "AMD" and *arch* should always be equal to "AMDGPU".
15513 By default, the assembler will derive the ISA version, *vendor*, and *arch*
15514 from the value of the -mcpu option that is passed to the assembler.
15516 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
15518 .amdgpu_hsa_kernel (name)
15519 +++++++++++++++++++++++++
15521 This directives specifies that the symbol with given name is a kernel entry
15522 point (label) and the object should contain corresponding symbol of type
15523 STT_AMDGPU_HSA_KERNEL.
15528 This directive marks the beginning of a list of key / value pairs that are used
15529 to specify the amd_kernel_code_t object that will be emitted by the assembler.
15530 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
15531 amd_kernel_code_t values that are unspecified a default value will be used. The
15532 default value for all keys is 0, with the following exceptions:
15534 - *amd_code_version_major* defaults to 1.
15535 - *amd_kernel_code_version_minor* defaults to 2.
15536 - *amd_machine_kind* defaults to 1.
15537 - *amd_machine_version_major*, *machine_version_minor*, and
15538 *amd_machine_version_stepping* are derived from the value of the -mcpu option
15539 that is passed to the assembler.
15540 - *kernel_code_entry_byte_offset* defaults to 256.
15541 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
15542 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
15543 Note that wavefront size is specified as a power of two, so a value of **n**
15544 means a size of 2^ **n**.
15545 - *call_convention* defaults to -1.
15546 - *kernarg_segment_alignment*, *group_segment_alignment*, and
15547 *private_segment_alignment* default to 4. Note that alignments are specified
15548 as a power of 2, so a value of **n** means an alignment of 2^ **n**.
15549 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
15551 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
15553 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
15555 The *.amd_kernel_code_t* directive must be placed immediately after the
15556 function label and before any instructions.
15558 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
15559 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
15561 .. _amdgpu-amdhsa-assembler-example-v2:
15563 Code Object V2 Example Source Code
15564 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15567 Code object V2 generation is no longer supported by this version of LLVM.
15569 Here is an example of a minimal assembly source file, defining one HSA kernel:
15574 .hsa_code_object_version 1,0
15575 .hsa_code_object_isa
15580 .amdgpu_hsa_kernel hello_world
15585 enable_sgpr_kernarg_segment_ptr = 1
15587 compute_pgm_rsrc1_vgprs = 0
15588 compute_pgm_rsrc1_sgprs = 0
15589 compute_pgm_rsrc2_user_sgpr = 2
15590 compute_pgm_rsrc1_wgp_mode = 0
15591 compute_pgm_rsrc1_mem_ordered = 0
15592 compute_pgm_rsrc1_fwd_progress = 1
15593 .end_amd_kernel_code_t
15595 s_load_dwordx2 s[0:1], s[0:1] 0x0
15596 v_mov_b32 v0, 3.14159
15597 s_waitcnt lgkmcnt(0)
15600 flat_store_dword v[1:2], v0
15603 .size hello_world, .Lfunc_end0-hello_world
15605 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
15607 Code Object V3 and Above Predefined Symbols
15608 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15610 The AMDGPU assembler defines and updates some symbols automatically. These
15611 symbols do not affect code generation.
15613 .amdgcn.gfx_generation_number
15614 +++++++++++++++++++++++++++++
15616 Set to the GFX major generation number of the target being assembled for. For
15617 example, when assembling for a "GFX9" target this will be set to the integer
15618 value "9". The possible GFX major generation numbers are presented in
15619 :ref:`amdgpu-processors`.
15621 .amdgcn.gfx_generation_minor
15622 ++++++++++++++++++++++++++++
15624 Set to the GFX minor generation number of the target being assembled for. For
15625 example, when assembling for a "GFX810" target this will be set to the integer
15626 value "1". The possible GFX minor generation numbers are presented in
15627 :ref:`amdgpu-processors`.
15629 .amdgcn.gfx_generation_stepping
15630 +++++++++++++++++++++++++++++++
15632 Set to the GFX stepping generation number of the target being assembled for.
15633 For example, when assembling for a "GFX704" target this will be set to the
15634 integer value "4". The possible GFX stepping generation numbers are presented
15635 in :ref:`amdgpu-processors`.
15637 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
15639 .amdgcn.next_free_vgpr
15640 ++++++++++++++++++++++
15642 Set to zero before assembly begins. At each instruction, if the current value
15643 of this symbol is less than or equal to the maximum VGPR number explicitly
15644 referenced within that instruction then the symbol value is updated to equal
15645 that VGPR number plus one.
15647 May be used to set the `.amdhsa_next_free_vgpr` directive in
15648 :ref:`amdhsa-kernel-directives-table`.
15650 May be set at any time, e.g. manually set to zero at the start of each kernel.
15652 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
15654 .amdgcn.next_free_sgpr
15655 ++++++++++++++++++++++
15657 Set to zero before assembly begins. At each instruction, if the current value
15658 of this symbol is less than or equal the maximum SGPR number explicitly
15659 referenced within that instruction then the symbol value is updated to equal
15660 that SGPR number plus one.
15662 May be used to set the `.amdhsa_next_free_spgr` directive in
15663 :ref:`amdhsa-kernel-directives-table`.
15665 May be set at any time, e.g. manually set to zero at the start of each kernel.
15667 .. _amdgpu-amdhsa-assembler-directives-v3-onwards:
15669 Code Object V3 and Above Directives
15670 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15672 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
15673 architecture processors, and are not OS-specific. Directives which begin with
15674 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
15675 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
15676 :ref:`amdgpu-processors`.
15678 .. _amdgpu-assembler-directive-amdgcn-target:
15680 .amdgcn_target <target-triple> "-" <target-id>
15681 ++++++++++++++++++++++++++++++++++++++++++++++
15683 Optional directive which declares the ``<target-triple>-<target-id>`` supported
15684 by the containing assembler source file. Used by the assembler to validate
15685 command-line options such as ``-triple``, ``-mcpu``, and
15686 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
15687 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
15691 The target ID syntax used for code object V2 to V3 for this directive differs
15692 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
15694 .. _amdgpu-assembler-directive-amdhsa-code-object-version:
15696 .amdhsa_code_object_version <version>
15697 +++++++++++++++++++++++++++++++++++++
15699 Optional directive which declares the code object version to be generated by the
15700 assembler. If not present, a default value will be used.
15702 .amdhsa_kernel <name>
15703 +++++++++++++++++++++
15705 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
15706 ``<name>.kd``, in the current location of the current section. Only valid when
15707 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
15708 instruction to execute, and does not need to be previously defined.
15710 Marks the beginning of a list of directives used to generate the bytes of a
15711 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
15712 Directives which may appear in this list are described in
15713 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
15714 be valid for the target being assembled for, and cannot be repeated. Directives
15715 support the range of values specified by the field they reference in
15716 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
15717 assumed to have its default value, unless it is marked as "Required", in which
15718 case it is an error to omit the directive. This list of directives is
15719 terminated by an ``.end_amdhsa_kernel`` directive.
15721 .. table:: AMDHSA Kernel Assembler Directives
15722 :name: amdhsa-kernel-directives-table
15724 ======================================================== =================== ============ ===================
15725 Directive Default Supported On Description
15726 ======================================================== =================== ============ ===================
15727 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX12 Controls GROUP_SEGMENT_FIXED_SIZE in
15728 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15729 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX12 Controls PRIVATE_SEGMENT_FIXED_SIZE in
15730 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15731 ``.amdhsa_kernarg_size`` 0 GFX6-GFX12 Controls KERNARG_SIZE in
15732 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15733 ``.amdhsa_user_sgpr_count`` 0 GFX6-GFX12 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
15734 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`
15735 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
15736 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15738 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_DISPATCH_PTR in
15739 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15740 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_QUEUE_PTR in
15741 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15742 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
15743 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15744 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX12 Controls ENABLE_SGPR_DISPATCH_ID in
15745 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15746 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
15747 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15749 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX12 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
15750 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15751 ``.amdhsa_wavefront_size32`` Target GFX10-GFX12 Controls ENABLE_WAVEFRONT_SIZE32 in
15752 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15755 ``.amdhsa_uses_dynamic_stack`` 0 GFX6-GFX12 Controls USES_DYNAMIC_STACK in
15756 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15757 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in
15758 (except :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15760 ``.amdhsa_enable_private_segment`` 0 GFX940, Controls ENABLE_PRIVATE_SEGMENT in
15761 GFX11-GFX12 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15762 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_X in
15763 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15764 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
15765 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15766 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
15767 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15768 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_INFO in
15769 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15770 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX12 Controls ENABLE_VGPR_WORKITEM_ID in
15771 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15772 Possible values are defined in
15773 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
15774 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX12 Maximum VGPR number explicitly referenced, plus one.
15775 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
15776 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15777 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX12 Maximum SGPR number explicitly referenced, plus one.
15778 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15779 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15780 ``.amdhsa_accum_offset`` Required GFX90A, Offset of a first AccVGPR in the unified register file.
15781 GFX940 Used to calculate ACCUM_OFFSET in
15782 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
15783 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX12 Whether the kernel may use the special VCC SGPR.
15784 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15785 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15786 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access
15787 (except scratch memory. Used to calculate
15788 GFX940) GRANULATED_WAVEFRONT_SGPR_COUNT in
15789 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15790 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay.
15791 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15792 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15794 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX12 Controls FLOAT_ROUND_MODE_32 in
15795 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15796 Possible values are defined in
15797 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15798 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX12 Controls FLOAT_ROUND_MODE_16_64 in
15799 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15800 Possible values are defined in
15801 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15802 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX12 Controls FLOAT_DENORM_MODE_32 in
15803 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15804 Possible values are defined in
15805 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15806 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX12 Controls FLOAT_DENORM_MODE_16_64 in
15807 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15808 Possible values are defined in
15809 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15810 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX11 Controls ENABLE_DX10_CLAMP in
15811 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15812 ``.amdhsa_ieee_mode`` 1 GFX6-GFX11 Controls ENABLE_IEEE_MODE in
15813 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15814 ``.amdhsa_round_robin_scheduling`` 0 GFX12 Controls ENABLE_WG_RR_EN in
15815 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15816 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX12 Controls FP16_OVFL in
15817 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15818 ``.amdhsa_tg_split`` Target GFX90A, Controls TG_SPLIT in
15819 Feature GFX940, :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
15820 Specific GFX11-GFX12
15822 ``.amdhsa_workgroup_processor_mode`` Target GFX10-GFX12 Controls ENABLE_WGP_MODE in
15823 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15826 ``.amdhsa_memory_ordered`` 1 GFX10-GFX12 Controls MEM_ORDERED in
15827 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15828 ``.amdhsa_forward_progress`` 0 GFX10-GFX12 Controls FWD_PROGRESS in
15829 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15830 ``.amdhsa_shared_vgpr_count`` 0 GFX10-GFX11 Controls SHARED_VGPR_COUNT in
15831 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
15832 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
15833 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15834 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
15835 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15836 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
15837 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15838 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
15839 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15840 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
15841 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15842 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
15843 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15844 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
15845 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15846 ``.amdhsa_user_sgpr_kernarg_preload_length`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_LENGTH in
15847 GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15848 ``.amdhsa_user_sgpr_kernarg_preload_offset`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_OFFSET in
15849 GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15850 ======================================================== =================== ============ ===================
15855 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
15856 note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
15858 The contents must be in the [YAML]_ markup format, with the same structure and
15859 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
15860 :ref:`amdgpu-amdhsa-code-object-metadata-v4` or
15861 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
15863 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
15865 .. _amdgpu-amdhsa-assembler-example-v3-onwards:
15867 Code Object V3 and Above Example Source Code
15868 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15870 Here is an example of a minimal assembly source file, defining one HSA kernel:
15875 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15880 .type hello_world,@function
15882 s_load_dwordx2 s[0:1], s[0:1] 0x0
15883 v_mov_b32 v0, 3.14159
15884 s_waitcnt lgkmcnt(0)
15887 flat_store_dword v[1:2], v0
15890 .size hello_world, .Lfunc_end0-hello_world
15894 .amdhsa_kernel hello_world
15895 .amdhsa_user_sgpr_kernarg_segment_ptr 1
15896 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15897 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15906 - .name: hello_world
15907 .symbol: hello_world.kd
15908 .kernarg_segment_size: 48
15909 .group_segment_fixed_size: 0
15910 .private_segment_fixed_size: 0
15911 .kernarg_segment_align: 4
15912 .wavefront_size: 64
15915 .max_flat_workgroup_size: 256
15919 .value_kind: global_buffer
15920 .address_space: global
15921 .actual_access: write_only
15923 .end_amdgpu_metadata
15925 This kernel is equivalent to the following HIP program:
15930 __global__ void hello_world(float *p) {
15934 If an assembly source file contains multiple kernels and/or functions, the
15935 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
15936 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
15937 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
15938 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
15939 to group the function with the kernel that calls it and reset the symbols
15940 between the two connected components:
15945 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15947 // gpr tracking symbols are implicitly set to zero
15952 .type kern0,@function
15957 .size kern0, .Lkern0_end-kern0
15961 .amdhsa_kernel kern0
15963 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15964 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15967 // reset symbols to begin tracking usage in func1 and kern1
15968 .set .amdgcn.next_free_vgpr, 0
15969 .set .amdgcn.next_free_sgpr, 0
15975 .type func1,@function
15978 s_setpc_b64 s[30:31]
15980 .size func1, .Lfunc1_end-func1
15984 .type kern1,@function
15988 s_add_u32 s4, s4, func1@rel32@lo+4
15989 s_addc_u32 s5, s5, func1@rel32@lo+4
15990 s_swappc_b64 s[30:31], s[4:5]
15994 .size kern1, .Lkern1_end-kern1
15998 .amdhsa_kernel kern1
16000 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
16001 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
16004 These symbols cannot identify connected components in order to automatically
16005 track the usage for each kernel. However, in some cases careful organization of
16006 the kernels and functions in the source file means there is minimal additional
16007 effort required to accurately calculate GPR usage.
16009 Additional Documentation
16010 ========================
16012 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
16013 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
16014 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
16015 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
16016 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
16017 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
16018 .. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
16019 .. [AMD-GCN-GFX940-GFX942-CDNA3] `AMD Instinct MI300 Instruction Set Architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf>`__
16020 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
16021 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
16022 .. [AMD-GCN-GFX11-RDNA3] `AMD RDNA 3 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf>`__
16023 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
16024 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
16025 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
16026 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
16027 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
16028 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
16029 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
16030 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
16031 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
16032 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
16033 .. [HRF] `Heterogeneous-race-free Memory Models <https://research.cs.wisc.edu/multifacet/papers/asplos14_hrf.pdf>`__
16034 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
16035 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
16036 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
16037 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
16038 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__