1 =============================
2 User Guide for AMDGPU Backend
3 =============================
14 AMDGPU/AMDGPUAsmGFX900
15 AMDGPU/AMDGPUAsmGFX904
16 AMDGPU/AMDGPUAsmGFX906
17 AMDGPU/AMDGPUAsmGFX908
18 AMDGPU/AMDGPUAsmGFX90a
19 AMDGPU/AMDGPUAsmGFX940
21 AMDGPU/AMDGPUAsmGFX1011
22 AMDGPU/AMDGPUAsmGFX1013
23 AMDGPU/AMDGPUAsmGFX1030
27 AMDGPUInstructionSyntax
28 AMDGPUInstructionNotation
29 AMDGPUDwarfExtensionsForHeterogeneousDebugging
30 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
35 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
36 R600 family up until the current GCN families. It lives in the
37 ``llvm/lib/Target/AMDGPU`` directory.
42 .. _amdgpu-target-triples:
47 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
48 to specify the target triple:
50 .. table:: AMDGPU Architectures
51 :name: amdgpu-architecture-table
53 ============ ==============================================================
54 Architecture Description
55 ============ ==============================================================
56 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
57 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
58 ============ ==============================================================
60 .. table:: AMDGPU Vendors
61 :name: amdgpu-vendor-table
63 ============ ==============================================================
65 ============ ==============================================================
66 ``amd`` Can be used for all AMD GPU usage.
67 ``mesa3d`` Can be used if the OS is ``mesa3d``.
68 ============ ==============================================================
70 .. table:: AMDGPU Operating Systems
73 ============== ============================================================
75 ============== ============================================================
76 *<empty>* Defaults to the *unknown* OS.
77 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
80 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
81 loader on Linux. See *AMD ROCm Platform Release Notes*
82 [AMD-ROCm-Release-Notes]_ for supported hardware and
84 - AMD's PAL runtime using the *pal-amdhsa* loader on
87 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL
88 runtime using the *pal-amdpal* loader on Windows and Linux
90 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa
91 3D runtime using the *mesa-mesa3d* loader on Linux.
92 ============== ============================================================
94 .. table:: AMDGPU Environments
95 :name: amdgpu-environment-table
97 ============ ==============================================================
98 Environment Description
99 ============ ==============================================================
101 ============ ==============================================================
103 .. _amdgpu-processors:
108 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
109 specify the AMDGPU processor together with optional target features. See
110 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
111 specific information.
113 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
115 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
118 .. table:: AMDGPU Processors
119 :name: amdgpu-processor-table
121 =========== =============== ============ ===== ================= =============== =============== ======================
122 Processor Alternative Target dGPU/ Target Target OS Support Example
123 Processor Triple APU Features Properties *(see* Products
124 Architecture Supported `amdgpu-os`_
133 =========== =============== ============ ===== ================= =============== =============== ======================
134 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
135 -----------------------------------------------------------------------------------------------------------------------
136 ``r600`` ``r600`` dGPU - Does not
141 ``r630`` ``r600`` dGPU - Does not
146 ``rs880`` ``r600`` dGPU - Does not
151 ``rv670`` ``r600`` dGPU - Does not
156 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
157 -----------------------------------------------------------------------------------------------------------------------
158 ``rv710`` ``r600`` dGPU - Does not
163 ``rv730`` ``r600`` dGPU - Does not
168 ``rv770`` ``r600`` dGPU - Does not
173 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
174 -----------------------------------------------------------------------------------------------------------------------
175 ``cedar`` ``r600`` dGPU - Does not
180 ``cypress`` ``r600`` dGPU - Does not
185 ``juniper`` ``r600`` dGPU - Does not
190 ``redwood`` ``r600`` dGPU - Does not
195 ``sumo`` ``r600`` dGPU - Does not
200 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
201 -----------------------------------------------------------------------------------------------------------------------
202 ``barts`` ``r600`` dGPU - Does not
207 ``caicos`` ``r600`` dGPU - Does not
212 ``cayman`` ``r600`` dGPU - Does not
217 ``turks`` ``r600`` dGPU - Does not
222 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
223 -----------------------------------------------------------------------------------------------------------------------
224 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
229 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
234 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
239 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
240 -----------------------------------------------------------------------------------------------------------------------
241 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000
242 flat - *pal-amdhsa* - A6 Pro-7050B
243 scratch - *pal-amdpal* - A8-7100
251 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100
252 flat - *pal-amdhsa* - FirePro W9100
253 scratch - *pal-amdpal* - FirePro S9150
255 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290
256 flat - *pal-amdhsa* - Radeon R9 290x
257 scratch - *pal-amdpal* - Radeon R390
259 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100
260 - ``mullins`` flat - *pal-amdpal* - E1-2200
268 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790
269 flat - *pal-amdpal* - Radeon HD 8770
272 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA*
279 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
280 -----------------------------------------------------------------------------------------------------------------------
281 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P
282 flat - *pal-amdhsa* - Pro A6-8500B
283 scratch - *pal-amdpal* - A8-8600P
299 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285
300 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380
301 scratch - *pal-amdpal* - Radeon R9 385
302 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano
303 - *pal-amdhsa* - Radeon R9 Fury
304 - *pal-amdpal* - Radeon R9 FuryX
307 - Radeon Instinct MI8
308 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470
309 flat - *pal-amdhsa* - Radeon RX 480
310 scratch - *pal-amdpal* - Radeon Instinct MI6
311 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460
313 scratch - *pal-amdpal*
314 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150
315 flat - *pal-amdhsa* - FirePro S7100
316 scratch - *pal-amdpal* - FirePro W7100
319 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA*
321 scratch - *pal-amdpal* .. TODO::
326 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_ [AMD-GCN-GFX940-GFX942-CDNA3]_
327 -----------------------------------------------------------------------------------------------------------------------
328 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega
329 flat - *pal-amdhsa* Frontier Edition
330 scratch - *pal-amdpal* - Radeon RX Vega 56
334 - Radeon Instinct MI25
335 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G
336 flat - *pal-amdhsa* - Ryzen 5 2400G
337 scratch - *pal-amdpal*
338 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA*
340 - *pal-amdpal* .. TODO::
345 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50
346 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60
347 scratch - *pal-amdpal* - Radeon VII
349 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
353 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA*
360 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA*
362 - xnack scratch .. TODO::
363 - kernarg preload - Packed
364 work-item Add product
367 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G
368 flat - Ryzen 7 4700GE
369 scratch - Ryzen 5 4600G
381 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
383 - xnack scratch .. TODO::
384 - kernarg preload - Packed
385 work-item Add product
388 ``gfx941`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
390 - xnack scratch .. TODO::
391 - kernarg preload - Packed
392 work-item Add product
395 ``gfx942`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
397 - xnack scratch .. TODO::
398 - kernarg preload - Packed
399 work-item Add product
402 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
403 -----------------------------------------------------------------------------------------------------------------------
404 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700
405 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT
406 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT
408 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520
409 - wavefrontsize64 - Absolute - *pal-amdhsa*
410 - xnack flat - *pal-amdpal*
412 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500
413 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT
414 - xnack scratch - *pal-amdpal*
415 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA*
416 - wavefrontsize64 flat - *pal-amdhsa*
417 - xnack scratch - *pal-amdpal* .. TODO::
422 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
423 -----------------------------------------------------------------------------------------------------------------------
424 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800
425 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT
426 scratch - *pal-amdpal* - Radeon RX 6900 XT
427 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT
428 - wavefrontsize64 flat - *pal-amdhsa*
429 scratch - *pal-amdpal*
430 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA*
431 - wavefrontsize64 flat - *pal-amdhsa*
432 scratch - *pal-amdpal* .. TODO::
437 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
438 - wavefrontsize64 flat
443 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA*
444 - wavefrontsize64 flat
450 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
451 - wavefrontsize64 flat
456 ``gfx1036`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
457 - wavefrontsize64 flat
463 **GCN GFX11 (RDNA 3)** [AMD-GCN-GFX11-RDNA3]_
464 -----------------------------------------------------------------------------------------------------------------------
465 ``gfx1100`` ``amdgcn`` dGPU - cumode - Architected - *pal-amdpal* *TBA*
466 - wavefrontsize64 flat
469 work-item Add product
472 ``gfx1101`` ``amdgcn`` dGPU - cumode - Architected *TBA*
473 - wavefrontsize64 flat
476 work-item Add product
479 ``gfx1102`` ``amdgcn`` dGPU - cumode - Architected *TBA*
480 - wavefrontsize64 flat
483 work-item Add product
486 ``gfx1103`` ``amdgcn`` APU - cumode - Architected *TBA*
487 - wavefrontsize64 flat
490 work-item Add product
493 ``gfx1150`` ``amdgcn`` APU - cumode - Architected *TBA*
494 - wavefrontsize64 flat
497 work-item Add product
500 ``gfx1151`` ``amdgcn`` APU - cumode - Architected *TBA*
501 - wavefrontsize64 flat
504 work-item Add product
507 ``gfx1200`` ``amdgcn`` dGPU - cumode - Architected *TBA*
508 - wavefrontsize64 flat
511 work-item Add product
514 ``gfx1201`` ``amdgcn`` dGPU - cumode - Architected *TBA*
515 - wavefrontsize64 flat
518 work-item Add product
521 =========== =============== ============ ===== ================= =============== =============== ======================
523 .. _amdgpu-target-features:
528 Target features control how code is generated to support certain
529 processor specific features. Not all target features are supported by
530 all processors. The runtime must ensure that the features supported by
531 the device used to execute the code match the features enabled when
532 generating the code. A mismatch of features may result in incorrect
533 execution, or a reduction in performance.
535 The target features supported by each processor is listed in
536 :ref:`amdgpu-processor-table`.
538 Target features are controlled by exactly one of the following Clang
541 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
543 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
544 optional components of the target ID. If omitted, the target feature has the
545 ``any`` value. See :ref:`amdgpu-target-id`.
547 ``-m[no-]<target-feature>``
549 Target features not specified by the target ID are specified using a
550 separate option. These target features can have an ``on`` or ``off``
551 value. ``on`` is specified by omitting the ``no-`` prefix, and
552 ``off`` is specified by including the ``no-`` prefix. The default
553 if not specified is ``off``.
557 ``-mcpu=gfx908:xnack+``
558 Enable the ``xnack`` feature.
559 ``-mcpu=gfx908:xnack-``
560 Disable the ``xnack`` feature.
562 Enable the ``cumode`` feature.
564 Disable the ``cumode`` feature.
566 .. table:: AMDGPU Target Features
567 :name: amdgpu-target-features-table
569 =============== ============================ ==================================================
570 Target Feature Clang Option to Control Description
572 =============== ============================ ==================================================
573 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
574 when generating code for kernels. When disabled
575 native WGP wavefront execution mode is used,
576 when enabled CU wavefront execution mode is used
577 (see :ref:`amdgpu-amdhsa-memory-model`).
579 sramecc - ``-mcpu`` If specified, generate code that can only be
580 - ``--offload-arch`` loaded and executed in a process that has a
581 matching setting for SRAMECC.
583 If not specified for code object V2 to V3, generate
584 code that can be loaded and executed in a process
585 with SRAMECC enabled.
587 If not specified for code object V4 or above, generate
588 code that can be loaded and executed in a process
589 with either setting of SRAMECC.
591 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
592 work-groups are launched in threadgroup split mode.
593 When enabled the waves of a work-group may be
594 launched in different CUs.
596 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
597 generating code for kernels. When disabled
598 native wavefront size 32 is used, when enabled
599 wavefront size 64 is used.
601 xnack - ``-mcpu`` If specified, generate code that can only be
602 - ``--offload-arch`` loaded and executed in a process that has a
603 matching setting for XNACK replay.
605 If not specified for code object V2 to V3, generate
606 code that can be loaded and executed in a process
607 with XNACK replay enabled.
609 If not specified for code object V4 or above, generate
610 code that can be loaded and executed in a process
611 with either setting of XNACK replay.
613 XNACK replay can be used for demand paging and
614 page migration. If enabled in the device, then if
615 a page fault occurs the code may execute
616 incorrectly unless generated with XNACK replay
617 enabled, or generated for code object V4 or above without
618 specifying XNACK replay. Executing code that was
619 generated with XNACK replay enabled, or generated
620 for code object V4 or above without specifying XNACK replay,
621 on a device that does not have XNACK replay
622 enabled will execute correctly but may be less
623 performant than code generated for XNACK replay
625 =============== ============================ ==================================================
627 .. _amdgpu-target-id:
632 AMDGPU supports target IDs. See `Clang Offload Bundler
633 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
634 description. The AMDGPU target specific information is:
637 Is an AMDGPU processor or alternative processor name specified in
638 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
639 the primary processor and alternative processor names. The canonical form
640 target ID only allow the primary processor name.
643 Is a target feature name specified in :ref:`amdgpu-target-features-table` that
644 is supported by the processor. The target features supported by each processor
645 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
646 a target ID are marked as being controlled by ``-mcpu`` and
647 ``--offload-arch``. Each target feature must appear at most once in a target
648 ID. The non-canonical form target ID allows the target features to be
649 specified in any order. The canonical form target ID requires the target
650 features to be specified in alphabetic order.
652 .. _amdgpu-target-id-v2-v3:
654 Code Object V2 to V3 Target ID
655 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
657 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
658 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
659 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
660 directive and the bundle entry ID. In those cases it has the following BNF
665 <target-id> ::== <processor> ( "+" <target-feature> )*
667 Where a target feature is omitted if *Off* and present if *On* or *Any*.
671 The code object V2 to V3 cannot represent *Any* and treats it the same as
674 .. _amdgpu-embedding-bundled-objects:
676 Embedding Bundled Code Objects
677 ------------------------------
679 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
680 as described in `Clang Offload Bundler
681 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
685 The target ID syntax used for code object V2 to V3 for a bundle entry ID
686 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
688 .. _amdgpu-address-spaces:
693 The AMDGPU architecture supports a number of memory address spaces. The address
694 space names use the OpenCL standard names, with some additions.
696 The AMDGPU address spaces correspond to target architecture specific LLVM
697 address space numbers used in LLVM IR.
699 The AMDGPU address spaces are described in
700 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
701 supported for the ``amdgcn`` target.
703 .. table:: AMDGPU Address Spaces
704 :name: amdgpu-address-spaces-table
706 ===================================== =============== =========== ================ ======= ============================
707 .. 64-Bit Process Address Space
708 ------------------------------------- --------------- ----------- ---------------- ------------------------------------
709 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value
710 Space Number Name Name Size
711 ===================================== =============== =========== ================ ======= ============================
712 Generic 0 flat flat 64 0x0000000000000000
713 Global 1 global global 64 0x0000000000000000
714 Region 2 N/A GDS 32 *not implemented for AMDHSA*
715 Local 3 group LDS 32 0xFFFFFFFF
716 Constant 4 constant *same as global* 64 0x0000000000000000
717 Private 5 private scratch 32 0xFFFFFFFF
718 Constant 32-bit 6 *TODO* 0x00000000
719 Buffer Fat Pointer (experimental) 7 *TODO*
720 Buffer Resource (experimental) 8 *TODO*
721 Buffer Strided Pointer (experimental) 9 *TODO*
722 Streamout Registers 128 N/A GS_REGS
723 ===================================== =============== =========== ================ ======= ============================
726 The generic address space is supported unless the *Target Properties* column
727 of :ref:`amdgpu-processor-table` specifies *Does not support generic address
730 The generic address space uses the hardware flat address support for two fixed
731 ranges of virtual addresses (the private and local apertures), that are
732 outside the range of addressable global memory, to map from a flat address to
733 a private or local address. This uses FLAT instructions that can take a flat
734 address and access global, private (scratch), and group (LDS) memory depending
735 on if the address is within one of the aperture ranges.
737 Flat access to scratch requires hardware aperture setup and setup in the
738 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
739 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
740 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
742 To convert between a private or group address space address (termed a segment
743 address) and a flat address the base address of the corresponding aperture
744 can be used. For GFX7-GFX8 these are available in the
745 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
746 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
747 GFX9-GFX11 the aperture base addresses are directly available as inline
748 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
749 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
750 aligned to 2^32 which makes it easier to convert from flat to segment or
753 A global address space address has the same value when used as a flat address
754 so no conversion is needed.
756 **Global and Constant**
757 The global and constant address spaces both use global virtual addresses,
758 which are the same virtual address space used by the CPU. However, some
759 virtual addresses may only be accessible to the CPU, some only accessible
760 by the GPU, and some by both.
762 Using the constant address space indicates that the data will not change
763 during the execution of the kernel. This allows scalar read instructions to
764 be used. As the constant address space could only be modified on the host
765 side, a generic pointer loaded from the constant address space is safe to be
766 assumed as a global pointer since only the device global memory is visible
767 and managed on the host side. The vector and scalar L1 caches are invalidated
768 of volatile data before each kernel dispatch execution to allow constant
769 memory to change values between kernel dispatches.
772 The region address space uses the hardware Global Data Store (GDS). All
773 wavefronts executing on the same device will access the same memory for any
774 given region address. However, the same region address accessed by wavefronts
775 executing on different devices will access different memory. It is higher
776 performance than global memory. It is allocated by the runtime. The data
777 store (DS) instructions can be used to access it.
780 The local address space uses the hardware Local Data Store (LDS) which is
781 automatically allocated when the hardware creates the wavefronts of a
782 work-group, and freed when all the wavefronts of a work-group have
783 terminated. All wavefronts belonging to the same work-group will access the
784 same memory for any given local address. However, the same local address
785 accessed by wavefronts belonging to different work-groups will access
786 different memory. It is higher performance than global memory. The data store
787 (DS) instructions can be used to access it.
790 The private address space uses the hardware scratch memory support which
791 automatically allocates memory when it creates a wavefront and frees it when
792 a wavefronts terminates. The memory accessed by a lane of a wavefront for any
793 given private address will be different to the memory accessed by another lane
794 of the same or different wavefront for the same private address.
796 If a kernel dispatch uses scratch, then the hardware allocates memory from a
797 pool of backing memory allocated by the runtime for each wavefront. The lanes
798 of the wavefront access this using dword (4 byte) interleaving. The mapping
799 used from private address to backing memory address is:
801 ``wavefront-scratch-base +
802 ((private-address / 4) * wavefront-size * 4) +
803 (wavefront-lane-id * 4) + (private-address % 4)``
805 If each lane of a wavefront accesses the same private address, the
806 interleaving results in adjacent dwords being accessed and hence requires
807 fewer cache lines to be fetched.
809 There are different ways that the wavefront scratch base address is
810 determined by a wavefront (see
811 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
813 Scratch memory can be accessed in an interleaved manner using buffer
814 instructions with the scratch buffer descriptor and per wavefront scratch
815 offset, by the scratch instructions, or by flat instructions. Multi-dword
816 access is not supported except by flat and scratch instructions in
819 Code that manipulates the stack values in other lanes of a wavefront,
820 such as by ``addrspacecast``-ing stack pointers to generic ones and taking offsets
821 that reach other lanes or by explicitly constructing the scratch buffer descriptor,
822 triggers undefined behavior when it modifies the scratch values of other lanes.
823 The compiler may assume that such modifications do not occur.
824 When using code object V5 ``LIBOMPTARGET_STACK_SIZE`` may be used to provide the
825 private segment size in bytes, for cases where a dynamic stack is used.
830 **Buffer Fat Pointer**
831 The buffer fat pointer is an experimental address space that is currently
832 unsupported in the backend. It exposes a non-integral pointer that is in
833 the future intended to support the modelling of 128-bit buffer descriptors
834 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
835 *pointer*), allowing normal LLVM load/store/atomic operations to be used to
836 model the buffer descriptors used heavily in graphics workloads targeting
839 The buffer descriptor used to construct a buffer fat pointer must be *raw*:
840 the stride must be 0, the "add tid" flag must be 0, the swizzle enable bits
841 must be off, and the extent must be measured in bytes. (On subtargets where
842 bounds checking may be disabled, buffer fat pointers may choose to enable
846 The buffer resource pointer, in address space 8, is the newer form
847 for representing buffer descriptors in AMDGPU IR, replacing their
848 previous representation as `<4 x i32>`. It is a non-integral pointer
849 that represents a 128-bit buffer descriptor resource (`V#`).
851 Since, in general, a buffer resource supports complex addressing modes that cannot
852 be easily represented in LLVM (such as implicit swizzled access to structured
853 buffers), it is **illegal** to perform non-trivial address computations, such as
854 ``getelementptr`` operations, on buffer resources. They may be passed to
855 AMDGPU buffer intrinsics, and they may be converted to and from ``i128``.
857 Casting a buffer resource to a buffer fat pointer is permitted and adds an offset
860 Buffer resources can be created from 64-bit pointers (which should be either
861 generic or global) using the `llvm.amdgcn.make.buffer.rsrc` intrinsic, which
862 takes the pointer, which becomes the base of the resource,
863 the 16-bit stride (and swzizzle control) field stored in bits `63:48` of a `V#`,
864 the 32-bit NumRecords/extent field (bits `95:64`), and the 32-bit flags field
865 (bits `127:96`). The specific interpretation of these fields varies by the
866 target architecture and is detailed in the ISA descriptions.
868 **Buffer Strided Pointer**
869 The buffer index pointer is an experimental address space. It represents
870 a 128-bit buffer descriptor and a 32-bit offset, like the **Buffer Fat
871 Pointer**. Additionally, it contains an index into the buffer, which
872 allows the direct addressing of structured elements. These components appear
873 in that order, i.e., the descriptor comes first, then the 32-bit offset
874 followed by the 32-bit index.
876 The bits in the buffer descriptor must meet the following requirements:
877 the stride is the size of a structured element, the "add tid" flag must be 0,
878 and the swizzle enable bits must be off.
880 **Streamout Registers**
881 Dedicated registers used by the GS NGG Streamout Instructions. The register
882 file is modelled as a memory in a distinct address space because it is indexed
883 by an address-like offset in place of named registers, and because register
884 accesses affect LGKMcnt. This is an internal address space used only by the
885 compiler. Do not use this address space for IR pointers.
887 .. _amdgpu-memory-scopes:
892 This section provides LLVM memory synchronization scopes supported by the AMDGPU
893 backend memory model when the target triple OS is ``amdhsa`` (see
894 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
896 The memory model supported is based on the HSA memory model [HSA]_ which is
897 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
898 relation is transitive over the synchronizes-with relation independent of scope
899 and synchronizes-with allows the memory scope instances to be inclusive (see
900 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
902 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
903 inclusion and requires the memory scopes to exactly match. However, this
904 is conservatively correct for OpenCL.
906 .. table:: AMDHSA LLVM Sync Scopes
907 :name: amdgpu-amdhsa-llvm-sync-scopes-table
909 ======================= ===================================================
910 LLVM Sync Scope Description
911 ======================= ===================================================
912 *none* The default: ``system``.
914 Synchronizes with, and participates in modification
915 and seq_cst total orderings with, other operations
916 (except image operations) for all address spaces
917 (except private, or generic that accesses private)
918 provided the other operation's sync scope is:
921 - ``agent`` and executed by a thread on the same
923 - ``workgroup`` and executed by a thread in the
925 - ``wavefront`` and executed by a thread in the
928 ``agent`` Synchronizes with, and participates in modification
929 and seq_cst total orderings with, other operations
930 (except image operations) for all address spaces
931 (except private, or generic that accesses private)
932 provided the other operation's sync scope is:
934 - ``system`` or ``agent`` and executed by a thread
936 - ``workgroup`` and executed by a thread in the
938 - ``wavefront`` and executed by a thread in the
941 ``workgroup`` Synchronizes with, and participates in modification
942 and seq_cst total orderings with, other operations
943 (except image operations) for all address spaces
944 (except private, or generic that accesses private)
945 provided the other operation's sync scope is:
947 - ``system``, ``agent`` or ``workgroup`` and
948 executed by a thread in the same work-group.
949 - ``wavefront`` and executed by a thread in the
952 ``wavefront`` Synchronizes with, and participates in modification
953 and seq_cst total orderings with, other operations
954 (except image operations) for all address spaces
955 (except private, or generic that accesses private)
956 provided the other operation's sync scope is:
958 - ``system``, ``agent``, ``workgroup`` or
959 ``wavefront`` and executed by a thread in the
962 ``singlethread`` Only synchronizes with and participates in
963 modification and seq_cst total orderings with,
964 other operations (except image operations) running
965 in the same thread for all address spaces (for
966 example, in signal handlers).
968 ``one-as`` Same as ``system`` but only synchronizes with other
969 operations within the same address space.
971 ``agent-one-as`` Same as ``agent`` but only synchronizes with other
972 operations within the same address space.
974 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with
975 other operations within the same address space.
977 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with
978 other operations within the same address space.
980 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
981 other operations within the same address space.
982 ======================= ===================================================
987 The AMDGPU backend implements the following LLVM IR intrinsics.
989 *This section is WIP.*
991 .. table:: AMDGPU LLVM IR Intrinsics
992 :name: amdgpu-llvm-ir-intrinsics-table
994 ============================================== ==========================================================
995 LLVM Intrinsic Description
996 ============================================== ==========================================================
997 llvm.amdgcn.sqrt Provides direct access to v_sqrt_f64, v_sqrt_f32 and v_sqrt_f16
998 (on targets with half support). Performs sqrt function.
1000 llvm.amdgcn.log Provides direct access to v_log_f32 and v_log_f16
1001 (on targets with half support). Performs log2 function.
1003 llvm.amdgcn.exp2 Provides direct access to v_exp_f32 and v_exp_f16
1004 (on targets with half support). Performs exp2 function.
1006 :ref:`llvm.frexp <int_frexp>` Implemented for half, float and double.
1008 :ref:`llvm.log2 <int_log2>` Implemented for float and half (and vectors of float or
1009 half). Not implemented for double. Hardware provides
1010 1ULP accuracy for float, and 0.51ULP for half. Float
1011 instruction does not natively support denormal
1014 :ref:`llvm.sqrt <int_sqrt>` Implemented for double, float and half (and vectors).
1016 :ref:`llvm.log <int_log>` Implemented for float and half (and vectors).
1018 :ref:`llvm.exp <int_exp>` Implemented for float and half (and vectors).
1020 :ref:`llvm.log10 <int_log10>` Implemented for float and half (and vectors).
1022 :ref:`llvm.exp2 <int_exp2>` Implemented for float and half (and vectors of float or
1023 half). Not implemented for double. Hardware provides
1024 1ULP accuracy for float, and 0.51ULP for half. Float
1025 instruction does not natively support denormal
1028 :ref:`llvm.stacksave.p5 <int_stacksave>` Implemented, must use the alloca address space.
1029 :ref:`llvm.stackrestore.p5 <int_stackrestore>` Implemented, must use the alloca address space.
1031 :ref:`llvm.get.fpmode.i32 <int_get_fpmode>` The natural floating-point mode type is i32. This
1032 implemented by extracting relevant bits out of the MODE
1033 register with s_getreg_b32. The first 10 bits are the
1034 core floating-point mode. Bits 12:18 are the exception
1035 mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not
1036 relevant to floating-point instructions are 0s.
1038 :ref:`llvm.get.rounding<int_get_rounding>` AMDGPU supports two separately controllable rounding
1039 modes depending on the floating-point type. One
1040 controls float, and the other controls both double and
1041 half operations. If both modes are the same, returns
1042 one of the standard return values. If the modes are
1043 different, returns one of :ref:`12 extended values
1044 <amdgpu-rounding-mode-enumeration-values-table>`
1045 describing the two modes.
1047 To nearest, ties away from zero is not a supported
1048 mode. The raw rounding mode values in the MODE
1049 register do not exactly match the FLT_ROUNDS values,
1050 so a conversion is performed.
1052 llvm.amdgcn.wave.reduce.umin Performs an arithmetic unsigned min reduction on the unsigned values
1053 provided by each lane in the wavefront.
1054 Intrinsic takes a hint for reduction strategy using second operand
1055 0: Target default preference,
1056 1: `Iterative strategy`, and
1058 If target does not support the DPP operations (e.g. gfx6/7),
1059 reduction will be performed using default iterative strategy.
1060 Intrinsic is currently only implemented for i32.
1062 llvm.amdgcn.wave.reduce.umax Performs an arithmetic unsigned max reduction on the unsigned values
1063 provided by each lane in the wavefront.
1064 Intrinsic takes a hint for reduction strategy using second operand
1065 0: Target default preference,
1066 1: `Iterative strategy`, and
1068 If target does not support the DPP operations (e.g. gfx6/7),
1069 reduction will be performed using default iterative strategy.
1070 Intrinsic is currently only implemented for i32.
1072 llvm.amdgcn.udot2 Provides direct access to v_dot2_u32_u16 across targets which
1073 support such instructions. This performs unsigned dot product
1074 with two v2i16 operands, summed with the third i32 operand. The
1075 i1 fourth operand is used to clamp the output.
1077 llvm.amdgcn.udot4 Provides direct access to v_dot4_u32_u8 across targets which
1078 support such instructions. This performs unsigned dot product
1079 with two i32 operands (holding a vector of 4 8bit values), summed
1080 with the third i32 operand. The i1 fourth operand is used to clamp
1083 llvm.amdgcn.udot8 Provides direct access to v_dot8_u32_u4 across targets which
1084 support such instructions. This performs unsigned dot product
1085 with two i32 operands (holding a vector of 8 4bit values), summed
1086 with the third i32 operand. The i1 fourth operand is used to clamp
1089 llvm.amdgcn.sdot2 Provides direct access to v_dot2_i32_i16 across targets which
1090 support such instructions. This performs signed dot product
1091 with two v2i16 operands, summed with the third i32 operand. The
1092 i1 fourth operand is used to clamp the output.
1093 When applicable (e.g. no clamping), this is lowered into
1094 v_dot2c_i32_i16 for targets which support it.
1096 llvm.amdgcn.sdot4 Provides direct access to v_dot4_i32_i8 across targets which
1097 support such instructions. This performs signed dot product
1098 with two i32 operands (holding a vector of 4 8bit values), summed
1099 with the third i32 operand. The i1 fourth operand is used to clamp
1101 When applicable (i.e. no clamping / operand modifiers), this is lowered
1102 into v_dot4c_i32_i8 for targets which support it.
1103 RDNA3 does not offer v_dot4_i32_i8, and rather offers
1104 v_dot4_i32_iu8 which has operands to hold the signedness of the
1105 vector operands. Thus, this intrinsic lowers to the signed version
1106 of this instruction for gfx11 targets.
1108 llvm.amdgcn.sdot8 Provides direct access to v_dot8_u32_u4 across targets which
1109 support such instructions. This performs signed dot product
1110 with two i32 operands (holding a vector of 8 4bit values), summed
1111 with the third i32 operand. The i1 fourth operand is used to clamp
1113 When applicable (i.e. no clamping / operand modifiers), this is lowered
1114 into v_dot8c_i32_i4 for targets which support it.
1115 RDNA3 does not offer v_dot8_i32_i4, and rather offers
1116 v_dot4_i32_iu4 which has operands to hold the signedness of the
1117 vector operands. Thus, this intrinsic lowers to the signed version
1118 of this instruction for gfx11 targets.
1120 llvm.amdgcn.sudot4 Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs
1121 dot product with two i32 operands (holding a vector of 4 8bit values), summed
1122 with the fifth i32 operand. The i1 sixth operand is used to clamp
1123 the output. The i1s preceding the vector operands decide the signedness.
1125 llvm.amdgcn.sudot8 Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs
1126 dot product with two i32 operands (holding a vector of 8 4bit values), summed
1127 with the fifth i32 operand. The i1 sixth operand is used to clamp
1128 the output. The i1s preceding the vector operands decide the signedness.
1130 llvm.amdgcn.sched_barrier Controls the types of instructions that may be allowed to cross the intrinsic
1131 during instruction scheduling. The parameter is a mask for the instruction types
1132 that can cross the intrinsic.
1134 - 0x0000: No instructions may be scheduled across sched_barrier.
1135 - 0x0001: All, non-memory, non-side-effect producing instructions may be
1136 scheduled across sched_barrier, *i.e.* allow ALU instructions to pass.
1137 - 0x0002: VALU instructions may be scheduled across sched_barrier.
1138 - 0x0004: SALU instructions may be scheduled across sched_barrier.
1139 - 0x0008: MFMA/WMMA instructions may be scheduled across sched_barrier.
1140 - 0x0010: All VMEM instructions may be scheduled across sched_barrier.
1141 - 0x0020: VMEM read instructions may be scheduled across sched_barrier.
1142 - 0x0040: VMEM write instructions may be scheduled across sched_barrier.
1143 - 0x0080: All DS instructions may be scheduled across sched_barrier.
1144 - 0x0100: All DS read instructions may be scheduled accoss sched_barrier.
1145 - 0x0200: All DS write instructions may be scheduled across sched_barrier.
1146 - 0x0400: All Transcendental (e.g. V_EXP) instructions may be scheduled across sched_barrier.
1148 llvm.amdgcn.sched_group_barrier Creates schedule groups with specific properties to create custom scheduling
1149 pipelines. The ordering between groups is enforced by the instruction scheduler.
1150 The intrinsic applies to the code that preceeds the intrinsic. The intrinsic
1151 takes three values that control the behavior of the schedule groups.
1153 - Mask : Classify instruction groups using the llvm.amdgcn.sched_barrier mask values.
1154 - Size : The number of instructions that are in the group.
1155 - SyncID : Order is enforced between groups with matching values.
1157 The mask can include multiple instruction types. It is undefined behavior to set
1158 values beyond the range of valid masks.
1160 Combining multiple sched_group_barrier intrinsics enables an ordering of specific
1161 instruction types during instruction scheduling. For example, the following enforces
1162 a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA
1165 | ``// 1 VMEM read``
1166 | ``__builtin_amdgcn_sched_group_barrier(32, 1, 0)``
1168 | ``__builtin_amdgcn_sched_group_barrier(2, 1, 0)``
1170 | ``__builtin_amdgcn_sched_group_barrier(8, 5, 0)``
1172 llvm.amdgcn.iglp_opt An **experimental** intrinsic for instruction group level parallelism. The intrinsic
1173 implements predefined intruction scheduling orderings. The intrinsic applies to the
1174 surrounding scheduling region. The intrinsic takes a value that specifies the
1175 strategy. The compiler implements two strategies.
1177 0. Interleave DS and MFMA instructions for small GEMM kernels.
1178 1. Interleave DS and MFMA instructions for single wave small GEMM kernels.
1180 Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic
1181 cannot be combined with sched_barrier or sched_group_barrier.
1183 The iglp_opt strategy implementations are subject to change.
1185 llvm.amdgcn.atomic.cond.sub.u32 Provides direct access to flat_atomic_cond_sub_u32, global_atomic_cond_sub_u32
1186 and ds_cond_sub_u32 based on address space on gfx12 targets. This
1187 performs subtraction only if the memory value is greater than or
1188 equal to the data value.
1190 llvm.amdgcn.s.getpc Provides access to the s_getpc_b64 instruction, but with the return value
1191 sign-extended from the width of the underlying PC hardware register even on
1192 processors where the s_getpc_b64 instruction returns a zero-extended value.
1194 ============================================== ==========================================================
1198 List AMDGPU intrinsics.
1203 The AMDGPU backend supports the following LLVM IR attributes.
1205 .. table:: AMDGPU LLVM IR Attributes
1206 :name: amdgpu-llvm-ir-attributes-table
1208 ======================================= ==========================================================
1209 LLVM Attribute Description
1210 ======================================= ==========================================================
1211 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
1212 will be specified when the kernel is dispatched. Generated
1213 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
1214 The IR implied default value is 1,1024. Clang may emit this attribute
1215 with more restrictive bounds depending on language defaults.
1216 If the actual block or workgroup size exceeds the limit at any point during
1217 the execution, the behavior is undefined. For example, even if there is
1218 only one active thread but the thread local id exceeds the limit, the
1219 behavior is undefined.
1221 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
1222 argument block size for the implicit arguments. This
1223 varies by OS and language (for OpenCL see
1224 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
1225 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
1226 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
1227 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
1228 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
1229 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
1230 execution unit. Generated by the ``amdgpu_waves_per_eu``
1231 CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
1232 and the backend may not be able to satisfy the request. If
1233 the specified range is incompatible with the function's
1234 "amdgpu-flat-work-group-size" value, the implied occupancy
1235 bounds by the workgroup size takes precedence.
1237 "amdgpu-ieee" true/false. GFX6-GFX11 Only
1238 Specify whether the function expects the IEEE field of the
1239 mode register to be set on entry. Overrides the default for
1240 the calling convention.
1241 "amdgpu-dx10-clamp" true/false. GFX6-GFX11 Only
1242 Specify whether the function expects the DX10_CLAMP field of
1243 the mode register to be set on entry. Overrides the default
1244 for the calling convention.
1246 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the
1247 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
1248 attribute, or reached through a call site marked with this attribute,
1249 the value returned by the intrinsic is undefined. The backend can
1250 generally infer this during code generation, so typically there is no
1251 benefit to frontends marking functions with this.
1253 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the
1254 llvm.amdgcn.workitem.id.y intrinsic.
1256 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the
1257 llvm.amdgcn.workitem.id.z intrinsic.
1259 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the
1260 llvm.amdgcn.workgroup.id.x intrinsic.
1262 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the
1263 llvm.amdgcn.workgroup.id.y intrinsic.
1265 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the
1266 llvm.amdgcn.workgroup.id.z intrinsic.
1268 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the
1269 llvm.amdgcn.dispatch.ptr intrinsic.
1271 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the
1272 llvm.amdgcn.implicitarg.ptr intrinsic.
1274 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the
1275 llvm.amdgcn.dispatch.id intrinsic.
1277 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the
1278 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
1279 attributes, the queue pointer may be required in situations where the
1280 intrinsic call does not directly appear in the program. Some subtargets
1281 require the queue pointer for to handle some addrspacecasts, as well
1282 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
1283 llvm.debug intrinsics.
1285 "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1286 kernel argument that holds the pointer to the hostcall buffer. If this
1287 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1289 "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1290 kernel argument that holds the pointer to an initialized memory buffer
1291 that conforms to the requirements of the malloc/free device library V1
1292 version implementation. If this attribute is absent, then the
1293 amdgpu-no-implicitarg-ptr is also removed.
1295 "amdgpu-no-multigrid-sync-arg" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1296 kernel argument that holds the multigrid synchronization pointer. If this
1297 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1299 "amdgpu-no-default-queue" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1300 kernel argument that holds the default queue pointer. If this
1301 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1303 "amdgpu-no-completion-action" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1304 kernel argument that holds the completion action pointer. If this
1305 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1307 "amdgpu-lds-size"="min[,max]" Min is the minimum number of bytes that will be allocated in the Local
1308 Data Store at address zero. Variables are allocated within this frame
1309 using absolute symbol metadata, primarily by the AMDGPULowerModuleLDS
1310 pass. Optional max is the maximum number of bytes that will be allocated.
1311 Note that min==max indicates that no further variables can be added to
1312 the frame. This is an internal detail of how LDS variables are lowered,
1313 language front ends should not set this attribute.
1315 ======================================= ==========================================================
1320 The AMDGPU backend supports the following calling conventions:
1322 .. table:: AMDGPU Calling Conventions
1325 =============================== ==========================================================
1326 Calling Convention Description
1327 =============================== ==========================================================
1328 ``ccc`` The C calling convention. Used by default.
1329 See :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions`
1332 ``fastcc`` The fast calling convention. Mostly the same as the ``ccc``.
1334 ``coldcc`` The cold calling convention. Mostly the same as the ``ccc``.
1336 ``amdgpu_cs`` Used for Mesa/AMDPAL compute shaders.
1340 ``amdgpu_cs_chain`` Similar to ``amdgpu_cs``, with differences described below.
1342 Functions with this calling convention cannot be called directly. They must
1343 instead be launched via the ``llvm.amdgcn.cs.chain`` intrinsic.
1345 Arguments are passed in SGPRs, starting at s0, if they have the ``inreg``
1346 attribute, and in VGPRs otherwise, starting at v8. Using more SGPRs or VGPRs
1347 than available in the subtarget is not allowed. On subtargets that use
1348 a scratch buffer descriptor (as opposed to ``scratch_{load,store}_*`` instructions),
1349 the scratch buffer descriptor is passed in s[48:51]. This limits the
1350 SGPR / ``inreg`` arguments to the equivalent of 48 dwords; using more
1351 than that is not allowed.
1353 The return type must be void.
1354 Varargs, sret, byval, byref, inalloca, preallocated are not supported.
1356 Values in scalar registers as well as v0-v7 are not preserved. Values in
1357 VGPRs starting at v8 are not preserved for the active lanes, but must be
1358 saved by the callee for inactive lanes when using WWM.
1360 Wave scratch is "empty" at function boundaries. There is no stack pointer input
1361 or output value, but functions are free to use scratch starting from an initial
1362 stack pointer. Calls to ``amdgpu_gfx`` functions are allowed and behave like they
1363 do in ``amdgpu_cs`` functions.
1365 All counters (``lgkmcnt``, ``vmcnt``, ``storecnt``, etc.) are presumed in an
1366 unknown state at function entry.
1368 A function may have multiple exits (e.g. one chain exit and one plain ``ret void``
1369 for when the wave ends), but all ``llvm.amdgcn.cs.chain`` exits must be in
1370 uniform control flow.
1372 ``amdgpu_cs_chain_preserve`` Same as ``amdgpu_cs_chain``, but active lanes for VGPRs starting at v8 are preserved.
1373 Calls to ``amdgpu_gfx`` functions are not allowed, and any calls to ``llvm.amdgcn.cs.chain``
1374 must not pass more VGPR arguments than the caller's VGPR function parameters.
1376 ``amdgpu_es`` Used for AMDPAL shader stage before geometry shader if geometry is in
1377 use. So either the domain (= tessellation evaluation) shader if
1378 tessellation is in use, or otherwise the vertex shader.
1382 ``amdgpu_gfx`` Used for AMD graphics targets. Functions with this calling convention
1383 cannot be used as entry points.
1387 ``amdgpu_gs`` Used for Mesa/AMDPAL geometry shaders.
1391 ``amdgpu_hs`` Used for Mesa/AMDPAL hull shaders (= tessellation control shaders).
1395 ``amdgpu_kernel`` See :ref:`amdgpu-amdhsa-function-call-convention-kernel-functions`
1397 ``amdgpu_ls`` Used for AMDPAL vertex shader if tessellation is in use.
1401 ``amdgpu_ps`` Used for Mesa/AMDPAL pixel shaders.
1405 ``amdgpu_vs`` Used for Mesa/AMDPAL last shader stage before rasterization (vertex
1406 shader if tessellation and geometry are not in use, or otherwise
1407 copy shader if one is needed).
1411 =============================== ==========================================================
1414 .. _amdgpu-elf-code-object:
1419 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
1420 can be linked by ``lld`` to produce a standard ELF shared code object which can
1421 be loaded and executed on an AMDGPU target.
1423 .. _amdgpu-elf-header:
1428 The AMDGPU backend uses the following ELF header:
1430 .. table:: AMDGPU ELF Header
1431 :name: amdgpu-elf-header-table
1433 ========================== ===============================
1435 ========================== ===============================
1436 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
1437 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
1438 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
1439 - ``ELFOSABI_AMDGPU_HSA``
1440 - ``ELFOSABI_AMDGPU_PAL``
1441 - ``ELFOSABI_AMDGPU_MESA3D``
1442 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1443 - ``ELFABIVERSION_AMDGPU_HSA_V3``
1444 - ``ELFABIVERSION_AMDGPU_HSA_V4``
1445 - ``ELFABIVERSION_AMDGPU_HSA_V5``
1446 - ``ELFABIVERSION_AMDGPU_PAL``
1447 - ``ELFABIVERSION_AMDGPU_MESA3D``
1448 ``e_type`` - ``ET_REL``
1450 ``e_machine`` ``EM_AMDGPU``
1452 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1453 :ref:`amdgpu-elf-header-e_flags-table-v3`,
1454 and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
1455 ========================== ===============================
1459 .. table:: AMDGPU ELF Header Enumeration Values
1460 :name: amdgpu-elf-header-enumeration-values-table
1462 =============================== =====
1464 =============================== =====
1467 ``ELFOSABI_AMDGPU_HSA`` 64
1468 ``ELFOSABI_AMDGPU_PAL`` 65
1469 ``ELFOSABI_AMDGPU_MESA3D`` 66
1470 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1471 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1472 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1473 ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1474 ``ELFABIVERSION_AMDGPU_PAL`` 0
1475 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1476 =============================== =====
1478 ``e_ident[EI_CLASS]``
1481 * ``ELFCLASS32`` for ``r600`` architecture.
1483 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1484 process address space applications.
1486 ``e_ident[EI_DATA]``
1487 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1489 ``e_ident[EI_OSABI]``
1490 One of the following AMDGPU target architecture specific OS ABIs
1491 (see :ref:`amdgpu-os`):
1493 * ``ELFOSABI_NONE`` for *unknown* OS.
1495 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1497 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1499 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1501 ``e_ident[EI_ABIVERSION]``
1502 The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1505 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1506 runtime ABI for code object V2. Can no longer be emitted by this version of LLVM.
1508 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1509 runtime ABI for code object V3. Can no longer be emitted by this version of LLVM.
1511 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1512 runtime ABI for code object V4. Specify using the Clang option
1513 ``-mcode-object-version=4``.
1515 * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1516 runtime ABI for code object V5. Specify using the Clang option
1517 ``-mcode-object-version=5``. This is the default code object
1518 version if not specified.
1520 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1523 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1527 Can be one of the following values:
1531 The type produced by the AMDGPU backend compiler as it is relocatable code
1535 The type produced by the linker as it is a shared code object.
1537 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1540 The value ``EM_AMDGPU`` is used for the machine for all processors supported
1541 by the ``r600`` and ``amdgcn`` architectures (see
1542 :ref:`amdgpu-processor-table`). The specific processor is specified in the
1543 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1544 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1545 ``e_flags`` for code object V3 and above (see
1546 :ref:`amdgpu-elf-header-e_flags-table-v3` and
1547 :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
1550 The entry point is 0 as the entry points for individual kernels must be
1551 selected in order to invoke them through AQL packets.
1554 The AMDGPU backend uses the following ELF header flags:
1556 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1557 :name: amdgpu-elf-header-e_flags-v2-table
1559 ===================================== ===== =============================
1560 Name Value Description
1561 ===================================== ===== =============================
1562 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack``
1564 enabled for all code
1565 contained in the code object.
1567 does not support the
1572 :ref:`amdgpu-target-features`.
1573 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap
1574 handler is enabled for all
1575 code contained in the code
1576 object. If the processor
1577 does not support a trap
1578 handler then must be 0.
1580 :ref:`amdgpu-target-features`.
1581 ===================================== ===== =============================
1583 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1584 :name: amdgpu-elf-header-e_flags-table-v3
1586 ================================= ===== =============================
1587 Name Value Description
1588 ================================= ===== =============================
1589 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1591 ``EF_AMDGPU_MACH_xxx`` values
1593 :ref:`amdgpu-ef-amdgpu-mach-table`.
1594 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack``
1596 enabled for all code
1597 contained in the code object.
1599 does not support the
1604 :ref:`amdgpu-target-features`.
1605 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc``
1607 enabled for all code
1608 contained in the code object.
1610 does not support the
1615 :ref:`amdgpu-target-features`.
1616 ================================= ===== =============================
1618 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
1619 :name: amdgpu-elf-header-e_flags-table-v4-onwards
1621 ============================================ ===== ===================================
1622 Name Value Description
1623 ============================================ ===== ===================================
1624 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1626 ``EF_AMDGPU_MACH_xxx`` values
1628 :ref:`amdgpu-ef-amdgpu-mach-table`.
1629 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for
1630 ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1632 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsupported.
1633 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value.
1634 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled.
1635 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled.
1636 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for
1637 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1639 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsupported.
1640 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value.
1641 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled,
1642 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled.
1643 ============================================ ===== ===================================
1645 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1646 :name: amdgpu-ef-amdgpu-mach-table
1648 ==================================== ========== =============================
1649 Name Value Description (see
1650 :ref:`amdgpu-processor-table`)
1651 ==================================== ========== =============================
1652 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
1653 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
1654 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
1655 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
1656 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
1657 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
1658 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
1659 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
1660 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
1661 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
1662 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
1663 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
1664 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
1665 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
1666 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
1667 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
1668 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
1669 *reserved* 0x011 - Reserved for ``r600``
1670 0x01f architecture processors.
1671 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
1672 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
1673 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
1674 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
1675 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
1676 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
1677 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
1678 *reserved* 0x027 Reserved.
1679 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
1680 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
1681 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
1682 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
1683 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
1684 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
1685 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
1686 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
1687 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908``
1688 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
1689 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c``
1690 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010``
1691 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011``
1692 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012``
1693 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030``
1694 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031``
1695 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032``
1696 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033``
1697 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602``
1698 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705``
1699 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805``
1700 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035``
1701 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034``
1702 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a``
1703 ``EF_AMDGPU_MACH_AMDGCN_GFX940`` 0x040 ``gfx940``
1704 ``EF_AMDGPU_MACH_AMDGCN_GFX1100`` 0x041 ``gfx1100``
1705 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013``
1706 ``EF_AMDGPU_MACH_AMDGCN_GFX1150`` 0x043 ``gfx1150``
1707 ``EF_AMDGPU_MACH_AMDGCN_GFX1103`` 0x044 ``gfx1103``
1708 ``EF_AMDGPU_MACH_AMDGCN_GFX1036`` 0x045 ``gfx1036``
1709 ``EF_AMDGPU_MACH_AMDGCN_GFX1101`` 0x046 ``gfx1101``
1710 ``EF_AMDGPU_MACH_AMDGCN_GFX1102`` 0x047 ``gfx1102``
1711 ``EF_AMDGPU_MACH_AMDGCN_GFX1200`` 0x048 ``gfx1200``
1712 *reserved* 0x049 Reserved.
1713 ``EF_AMDGPU_MACH_AMDGCN_GFX1151`` 0x04a ``gfx1151``
1714 ``EF_AMDGPU_MACH_AMDGCN_GFX941`` 0x04b ``gfx941``
1715 ``EF_AMDGPU_MACH_AMDGCN_GFX942`` 0x04c ``gfx942``
1716 *reserved* 0x04d Reserved.
1717 ``EF_AMDGPU_MACH_AMDGCN_GFX1201`` 0x04e ``gfx1201``
1718 ==================================== ========== =============================
1723 An AMDGPU target ELF code object has the standard ELF sections which include:
1725 .. table:: AMDGPU ELF Sections
1726 :name: amdgpu-elf-sections-table
1728 ================== ================ =================================
1729 Name Type Attributes
1730 ================== ================ =================================
1731 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1732 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1733 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
1734 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
1735 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1736 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1737 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1738 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
1739 ``.note`` ``SHT_NOTE`` *none*
1740 ``.rela``\ *name* ``SHT_RELA`` *none*
1741 ``.rela.dyn`` ``SHT_RELA`` *none*
1742 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1743 ``.shstrtab`` ``SHT_STRTAB`` *none*
1744 ``.strtab`` ``SHT_STRTAB`` *none*
1745 ``.symtab`` ``SHT_SYMTAB`` *none*
1746 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1747 ================== ================ =================================
1749 These sections have their standard meanings (see [ELF]_) and are only generated
1753 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1754 information on the DWARF produced by the AMDGPU backend.
1756 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1757 The standard sections used by a dynamic loader.
1760 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1763 ``.rela``\ *name*, ``.rela.dyn``
1764 For relocatable code objects, *name* is the name of the section that the
1765 relocation records apply. For example, ``.rela.text`` is the section name for
1766 relocation records associated with the ``.text`` section.
1768 For linked shared code objects, ``.rela.dyn`` contains all the relocation
1769 records from each of the relocatable code object's ``.rela``\ *name* sections.
1771 See :ref:`amdgpu-relocation-records` for the relocation records supported by
1775 The executable machine code for the kernels and functions they call. Generated
1776 as position independent code. See :ref:`amdgpu-code-conventions` for
1777 information on conventions used in the isa generation.
1779 .. _amdgpu-note-records:
1784 The AMDGPU backend code object contains ELF note records in the ``.note``
1785 section. The set of generated notes and their semantics depend on the code
1786 object version; see :ref:`amdgpu-note-records-v2` and
1787 :ref:`amdgpu-note-records-v3-onwards`.
1789 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1790 must be generated after the ``name`` field to ensure the ``desc`` field is 4
1791 byte aligned. In addition, minimal zero-byte padding must be generated to
1792 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1793 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1796 .. _amdgpu-note-records-v2:
1798 Code Object V2 Note Records
1799 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1802 Code object V2 generation is no longer supported by this version of LLVM.
1804 The AMDGPU backend code object uses the following ELF note record in the
1805 ``.note`` section when compiling for code object V2.
1807 The note record vendor field is "AMD".
1809 Additional note records may be present, but any which are not documented here
1810 are deprecated and should not be used.
1812 .. table:: AMDGPU Code Object V2 ELF Note Records
1813 :name: amdgpu-elf-note-records-v2-table
1815 ===== ===================================== ======================================
1816 Name Type Description
1817 ===== ===================================== ======================================
1818 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version.
1819 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL
1820 Finalizer and not the LLVM compiler.
1821 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version.
1822 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in
1823 YAML [YAML]_ textual format.
1824 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name.
1825 ===== ===================================== ======================================
1829 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1830 :name: amdgpu-elf-note-record-enumeration-values-v2-table
1832 ===================================== =====
1834 ===================================== =====
1835 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1
1836 ``NT_AMD_HSA_HSAIL`` 2
1837 ``NT_AMD_HSA_ISA_VERSION`` 3
1839 ``NT_AMD_HSA_METADATA`` 10
1840 ``NT_AMD_HSA_ISA_NAME`` 11
1841 ===================================== =====
1843 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
1844 Specifies the code object version number. The description field has the
1849 struct amdgpu_hsa_note_code_object_version_s {
1850 uint32_t major_version;
1851 uint32_t minor_version;
1854 The ``major_version`` has a value less than or equal to 2.
1856 ``NT_AMD_HSA_HSAIL``
1857 Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1858 field has the following layout:
1862 struct amdgpu_hsa_note_hsail_s {
1863 uint32_t hsail_major_version;
1864 uint32_t hsail_minor_version;
1866 uint8_t machine_model;
1867 uint8_t default_float_round;
1870 ``NT_AMD_HSA_ISA_VERSION``
1871 Specifies the target ISA version. The description field has the following layout:
1875 struct amdgpu_hsa_note_isa_s {
1876 uint16_t vendor_name_size;
1877 uint16_t architecture_name_size;
1881 char vendor_and_architecture_name[1];
1884 ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1885 vendor and architecture names respectively, including the NUL character.
1887 ``vendor_and_architecture_name`` contains the NUL terminates string for the
1888 vendor, immediately followed by the NUL terminated string for the
1891 This note record is used by the HSA runtime loader.
1893 Code object V2 only supports a limited number of processors and has fixed
1894 settings for target features. See
1895 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1896 processors and the corresponding target ID. In the table the note record ISA
1897 name is a concatenation of the vendor name, architecture name, major, minor,
1898 and stepping separated by a ":".
1900 The target ID column shows the processor name and fixed target features used
1901 by the LLVM compiler. The LLVM compiler does not generate a
1902 ``NT_AMD_HSA_HSAIL`` note record.
1904 A code object generated by the Finalizer also uses code object V2 and always
1905 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1906 ``sramecc`` target feature is as shown in
1907 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1908 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1911 ``NT_AMD_HSA_ISA_NAME``
1912 Specifies the target ISA name as a non-NUL terminated string.
1914 This note record is not used by the HSA runtime loader.
1916 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1917 V2's limited support of processors and fixed settings for target features.
1919 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1920 from the string to the corresponding target ID. If the ``xnack`` target
1921 feature is supported and enabled, the string produced by the LLVM compiler
1922 will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1923 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1925 ``NT_AMD_HSA_METADATA``
1926 Specifies extensible metadata associated with the code objects executed on HSA
1927 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1928 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1929 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1932 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1933 :name: amdgpu-elf-note-record-supported_processors-v2-table
1935 ===================== ==========================
1936 Note Record ISA Name Target ID
1937 ===================== ==========================
1938 ``AMD:AMDGPU:6:0:0`` ``gfx600``
1939 ``AMD:AMDGPU:6:0:1`` ``gfx601``
1940 ``AMD:AMDGPU:6:0:2`` ``gfx602``
1941 ``AMD:AMDGPU:7:0:0`` ``gfx700``
1942 ``AMD:AMDGPU:7:0:1`` ``gfx701``
1943 ``AMD:AMDGPU:7:0:2`` ``gfx702``
1944 ``AMD:AMDGPU:7:0:3`` ``gfx703``
1945 ``AMD:AMDGPU:7:0:4`` ``gfx704``
1946 ``AMD:AMDGPU:7:0:5`` ``gfx705``
1947 ``AMD:AMDGPU:8:0:0`` ``gfx802``
1948 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+``
1949 ``AMD:AMDGPU:8:0:2`` ``gfx802``
1950 ``AMD:AMDGPU:8:0:3`` ``gfx803``
1951 ``AMD:AMDGPU:8:0:4`` ``gfx803``
1952 ``AMD:AMDGPU:8:0:5`` ``gfx805``
1953 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+``
1954 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-``
1955 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+``
1956 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-``
1957 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+``
1958 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-``
1959 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+``
1960 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-``
1961 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+``
1962 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1963 ===================== ==========================
1965 .. _amdgpu-note-records-v3-onwards:
1967 Code Object V3 and Above Note Records
1968 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1970 The AMDGPU backend code object uses the following ELF note record in the
1971 ``.note`` section when compiling for code object V3 and above.
1973 The note record vendor field is "AMDGPU".
1975 Additional note records may be present, but any which are not documented here
1976 are deprecated and should not be used.
1978 .. table:: AMDGPU Code Object V3 and Above ELF Note Records
1979 :name: amdgpu-elf-note-records-table-v3-onwards
1981 ======== ============================== ======================================
1982 Name Type Description
1983 ======== ============================== ======================================
1984 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
1986 ======== ============================== ======================================
1990 .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
1991 :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
1993 ============================== =====
1995 ============================== =====
1997 ``NT_AMDGPU_METADATA`` 32
1998 ============================== =====
2000 ``NT_AMDGPU_METADATA``
2001 Specifies extensible metadata associated with an AMDGPU code object. It is
2002 encoded as a map in the Message Pack [MsgPack]_ binary data format. See
2003 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2004 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2005 :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
2013 Symbols include the following:
2015 .. table:: AMDGPU ELF Symbols
2016 :name: amdgpu-elf-symbols-table
2018 ===================== ================== ================ ==================
2019 Name Type Section Description
2020 ===================== ================== ================ ==================
2021 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
2024 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
2025 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
2026 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS
2027 ===================== ================== ================ ==================
2030 Global variables both used and defined by the compilation unit.
2032 If the symbol is defined in the compilation unit then it is allocated in the
2033 appropriate section according to if it has initialized data or is readonly.
2035 If the symbol is external then its section is ``STN_UNDEF`` and the loader
2036 will resolve relocations using the definition provided by another code object
2037 or explicitly defined by the runtime.
2039 If the symbol resides in local/group memory (LDS) then its section is the
2040 special processor specific section name ``SHN_AMDGPU_LDS``, and the
2041 ``st_value`` field describes alignment requirements as it does for common
2046 Add description of linked shared object symbols. Seems undefined symbols
2047 are marked as STT_NOTYPE.
2050 Every HSA kernel has an associated kernel descriptor. It is the address of the
2051 kernel descriptor that is used in the AQL dispatch packet used to invoke the
2052 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
2053 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
2056 Every HSA kernel also has a symbol for its machine code entry point.
2058 .. _amdgpu-relocation-records:
2063 The AMDGPU backend generates ``Elf64_Rela`` relocation records for
2064 AMDHSA or ``Elf64_Rel`` relocation records for Mesa/AMDPAL. Supported
2065 relocatable fields are:
2068 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
2069 alignment. These values use the same byte order as other word values in the
2070 AMDGPU architecture.
2073 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
2074 alignment. These values use the same byte order as other word values in the
2075 AMDGPU architecture.
2077 Following notations are used for specifying relocation calculations:
2080 Represents the addend used to compute the value of the relocatable field. If
2081 the addend field is smaller than 64 bits then it is zero-extended to 64 bits
2082 for use in the calculations below. (In practice this only affects ``_HI``
2083 relocation types on Mesa/AMDPAL, where the addend comes from the 32-bit field
2084 but the result of the calculation depends on the high part of the full 64-bit
2088 Represents the offset into the global offset table at which the relocation
2089 entry's symbol will reside during execution.
2092 Represents the address of the global offset table.
2095 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
2096 of the storage unit being relocated (computed using ``r_offset``).
2099 Represents the value of the symbol whose index resides in the relocation
2100 entry. Relocations not using this must specify a symbol index of
2104 Represents the base address of a loaded executable or shared object which is
2105 the difference between the ELF address and the actual load address.
2106 Relocations using this are only valid in executable or shared objects.
2108 The following relocation types are supported:
2110 .. table:: AMDGPU ELF Relocation Records
2111 :name: amdgpu-elf-relocation-records-table
2113 ========================== ======= ===== ========== ==============================
2114 Relocation Type Kind Value Field Calculation
2115 ========================== ======= ===== ========== ==============================
2116 ``R_AMDGPU_NONE`` 0 *none* *none*
2117 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
2119 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
2121 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
2123 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
2124 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
2125 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
2127 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
2128 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
2129 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
2130 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
2131 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
2133 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
2134 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4
2135 ========================== ======= ===== ========== ==============================
2137 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
2138 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
2140 There is no current OS loader support for 32-bit programs and so
2141 ``R_AMDGPU_ABS32`` is not used.
2143 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
2145 Loaded Code Object Path Uniform Resource Identifier (URI)
2146 ---------------------------------------------------------
2148 The AMD GPU code object loader represents the path of the ELF shared object from
2149 which the code object was loaded as a textual Uniform Resource Identifier (URI).
2150 Note that the code object is the in memory loaded relocated form of the ELF
2151 shared object. Multiple code objects may be loaded at different memory
2152 addresses in the same process from the same ELF shared object.
2154 The loaded code object path URI syntax is defined by the following BNF syntax:
2158 code_object_uri ::== file_uri | memory_uri
2159 file_uri ::== "file://" file_path [ range_specifier ]
2160 memory_uri ::== "memory://" process_id range_specifier
2161 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
2162 file_path ::== URI_ENCODED_OS_FILE_PATH
2163 process_id ::== DECIMAL_NUMBER
2164 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
2167 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
2168 and octal values by "0".
2171 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
2172 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
2173 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in
2174 the path are separated by "/".
2177 Is a 0-based byte offset to the start of the code object. For a file URI, it
2178 is from the start of the file specified by the ``file_path``, and if omitted
2179 defaults to 0. For a memory URI, it is the memory address and is required.
2182 Is the number of bytes in the code object. For a file URI, if omitted it
2183 defaults to the size of the file. It is required for a memory URI.
2186 Is the identity of the process owning the memory. For Linux it is the C
2187 unsigned integral decimal literal for the process ID (PID).
2193 file:///dir1/dir2/file1
2194 file:///dir3/dir4/file2#offset=0x2000&size=3000
2195 memory://1234#offset=0x20000&size=3000
2197 .. _amdgpu-dwarf-debug-information:
2199 DWARF Debug Information
2200 =======================
2204 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
2205 is not currently fully implemented and is subject to change.
2207 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
2208 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
2209 object executable code and data to the source language constructs. It can be
2210 used by tools such as debuggers and profilers. It uses features defined in
2211 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
2212 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
2214 This section defines the AMDGPU target architecture specific DWARF mappings.
2216 .. _amdgpu-dwarf-register-identifier:
2221 This section defines the AMDGPU target architecture register numbers used in
2222 DWARF operation expressions (see DWARF Version 5 section 2.5 and
2223 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
2224 instructions (see DWARF Version 5 section 6.4 and
2225 :ref:`amdgpu-dwarf-call-frame-information`).
2227 A single code object can contain code for kernels that have different wavefront
2228 sizes. The vector registers and some scalar registers are based on the wavefront
2229 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
2230 simplifies the consumer of the DWARF so that each register has a fixed size,
2231 rather than being dynamic according to the wavefront size mode. Similarly,
2232 distinct DWARF registers are defined for those registers that vary in size
2233 according to the process address size. This allows a consumer to treat a
2234 specific AMDGPU processor as a single architecture regardless of how it is
2235 configured at run time. The compiler explicitly specifies the DWARF registers
2236 that match the mode in which the code it is generating will be executed.
2238 DWARF registers are encoded as numbers, which are mapped to architecture
2239 registers. The mapping for AMDGPU is defined in
2240 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
2243 .. table:: AMDGPU DWARF Register Mapping
2244 :name: amdgpu-dwarf-register-mapping-table
2246 ============== ================= ======== ==================================
2247 DWARF Register AMDGPU Register Bit Size Description
2248 ============== ================= ======== ==================================
2249 0 PC_32 32 Program Counter (PC) when
2250 executing in a 32-bit process
2251 address space. Used in the CFI to
2252 describe the PC of the calling
2254 1 EXEC_MASK_32 32 Execution Mask Register when
2255 executing in wavefront 32 mode.
2256 2-15 *Reserved* *Reserved for highly accessed
2257 registers using DWARF shortcut.*
2258 16 PC_64 64 Program Counter (PC) when
2259 executing in a 64-bit process
2260 address space. Used in the CFI to
2261 describe the PC of the calling
2263 17 EXEC_MASK_64 64 Execution Mask Register when
2264 executing in wavefront 64 mode.
2265 18-31 *Reserved* *Reserved for highly accessed
2266 registers using DWARF shortcut.*
2267 32-95 SGPR0-SGPR63 32 Scalar General Purpose
2269 96-127 *Reserved* *Reserved for frequently accessed
2270 registers using DWARF 1-byte ULEB.*
2271 128 STATUS 32 Status Register.
2272 129-511 *Reserved* *Reserved for future Scalar
2273 Architectural Registers.*
2274 512 VCC_32 32 Vector Condition Code Register
2275 when executing in wavefront 32
2277 513-767 *Reserved* *Reserved for future Vector
2278 Architectural Registers when
2279 executing in wavefront 32 mode.*
2280 768 VCC_64 64 Vector Condition Code Register
2281 when executing in wavefront 64
2283 769-1023 *Reserved* *Reserved for future Vector
2284 Architectural Registers when
2285 executing in wavefront 64 mode.*
2286 1024-1087 *Reserved* *Reserved for padding.*
2287 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers.
2288 1130-1535 *Reserved* *Reserved for future Scalar
2289 General Purpose Registers.*
2290 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers
2291 when executing in wavefront 32
2293 1792-2047 *Reserved* *Reserved for future Vector
2294 General Purpose Registers when
2295 executing in wavefront 32 mode.*
2296 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers
2297 when executing in wavefront 32
2299 2304-2559 *Reserved* *Reserved for future Vector
2300 Accumulation Registers when
2301 executing in wavefront 32 mode.*
2302 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers
2303 when executing in wavefront 64
2305 2816-3071 *Reserved* *Reserved for future Vector
2306 General Purpose Registers when
2307 executing in wavefront 64 mode.*
2308 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers
2309 when executing in wavefront 64
2311 3328-3583 *Reserved* *Reserved for future Vector
2312 Accumulation Registers when
2313 executing in wavefront 64 mode.*
2314 ============== ================= ======== ==================================
2316 The vector registers are represented as the full size for the wavefront. They
2317 are organized as consecutive dwords (32-bits), one per lane, with the dword at
2318 the least significant bit position corresponding to lane 0 and so forth. DWARF
2319 location expressions involving the ``DW_OP_LLVM_offset`` and
2320 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
2321 register corresponding to the lane that is executing the current thread of
2322 execution in languages that are implemented using a SIMD or SIMT execution
2325 If the wavefront size is 32 lanes then the wavefront 32 mode register
2326 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
2327 mode register definitions are used. Some AMDGPU targets support executing in
2328 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
2329 to the wavefront mode of the generated code will be used.
2331 If code is generated to execute in a 32-bit process address space, then the
2332 32-bit process address space register definitions are used. If code is generated
2333 to execute in a 64-bit process address space, then the 64-bit process address
2334 space register definitions are used. The ``amdgcn`` target only supports the
2335 64-bit process address space.
2337 .. _amdgpu-dwarf-memory-space-identifier:
2339 Memory Space Identifier
2340 -----------------------
2342 The DWARF memory space represents the source language memory space. See DWARF
2343 Version 5 section 2.12 which is updated by the *DWARF Extensions For
2344 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-memory-spaces`.
2346 The DWARF memory space mapping used for AMDGPU is defined in
2347 :ref:`amdgpu-dwarf-memory-space-mapping-table`.
2349 .. table:: AMDGPU DWARF Memory Space Mapping
2350 :name: amdgpu-dwarf-memory-space-mapping-table
2352 =========================== ====== =================
2354 ---------------------------------- -----------------
2355 Memory Space Name Value Memory Space
2356 =========================== ====== =================
2357 ``DW_MSPACE_LLVM_none`` 0x0000 Generic (Flat)
2358 ``DW_MSPACE_LLVM_global`` 0x0001 Global
2359 ``DW_MSPACE_LLVM_constant`` 0x0002 Global
2360 ``DW_MSPACE_LLVM_group`` 0x0003 Local (group/LDS)
2361 ``DW_MSPACE_LLVM_private`` 0x0004 Private (Scratch)
2362 ``DW_MSPACE_AMDGPU_region`` 0x8000 Region (GDS)
2363 =========================== ====== =================
2365 The DWARF memory space values defined in the *DWARF Extensions For Heterogeneous
2366 Debugging* section :ref:`amdgpu-dwarf-memory-spaces` are used.
2368 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
2369 available for use for the AMD extension for access to the hardware GDS memory
2370 which is scratchpad memory allocated per device.
2372 For AMDGPU if no ``DW_AT_LLVM_memory_space`` attribute is present, then the
2373 default memory space of ``DW_MSPACE_LLVM_none`` is used.
2375 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
2376 mapping of DWARF memory spaces to DWARF address spaces, including address size
2379 .. _amdgpu-dwarf-address-space-identifier:
2381 Address Space Identifier
2382 ------------------------
2384 DWARF address spaces correspond to target architecture specific linear
2385 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
2386 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-address-spaces`.
2388 The DWARF address space mapping used for AMDGPU is defined in
2389 :ref:`amdgpu-dwarf-address-space-mapping-table`.
2391 .. table:: AMDGPU DWARF Address Space Mapping
2392 :name: amdgpu-dwarf-address-space-mapping-table
2394 ======================================= ===== ======= ======== ===================== =======================
2396 --------------------------------------- ----- ---------------- --------------------- -----------------------
2397 Address Space Name Value Address Bit Size LLVM IR Address Space
2398 --------------------------------------- ----- ------- -------- --------------------- -----------------------
2403 ======================================= ===== ======= ======== ===================== =======================
2404 ``DW_ASPACE_LLVM_none`` 0x00 64 32 Global *default address space*
2405 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat)
2406 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS)
2407 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS)
2409 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane*
2410 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront*
2411 ======================================= ===== ======= ======== ===================== =======================
2413 See :ref:`amdgpu-address-spaces` for information on the AMDGPU LLVM IR address
2414 spaces including address size and NULL value.
2416 The ``DW_ASPACE_LLVM_none`` address space is the default target architecture
2417 address space used in DWARF operations that do not specify an address space. It
2418 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
2419 related operations can refer to addresses in the program code.
2421 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
2422 specify the flat address space. If the address corresponds to an address in the
2423 local address space, then it corresponds to the wavefront that is executing the
2424 focused thread of execution. If the address corresponds to an address in the
2425 private address space, then it corresponds to the lane that is executing the
2426 focused thread of execution for languages that are implemented using a SIMD or
2427 SIMT execution model.
2431 CUDA-like languages such as HIP that do not have address spaces in the
2432 language type system, but do allow variables to be allocated in different
2433 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
2434 address space in the DWARF expression operations as the default address space
2435 is the global address space.
2437 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2438 specify the local address space corresponding to the wavefront that is executing
2439 the focused thread of execution.
2441 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2442 to specify the private address space corresponding to the lane that is executing
2443 the focused thread of execution for languages that are implemented using a SIMD
2444 or SIMT execution model.
2446 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2447 to specify the unswizzled private address space corresponding to the wavefront
2448 that is executing the focused thread of execution. The wavefront view of private
2449 memory is the per wavefront unswizzled backing memory layout defined in
2450 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2451 location for the backing memory of the wavefront (namely the address is not
2452 offset by ``wavefront-scratch-base``). The following formula can be used to
2453 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2454 ``DW_ASPACE_AMDGPU_private_wave`` address:
2458 private-address-wavefront =
2459 ((private-address-lane / 4) * wavefront-size * 4) +
2460 (wavefront-lane-id * 4) + (private-address-lane % 4)
2462 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2463 of the dwords for each lane starting with lane 0 is required, then this
2468 private-address-wavefront =
2469 private-address-lane * wavefront-size
2471 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2472 complete spilled vector register back into a complete vector register in the
2473 CFI. The frame pointer can be a private lane address which is dword aligned,
2474 which can be shifted to multiply by the wavefront size, and then used to form a
2475 private wavefront address that gives a location for a contiguous set of dwords,
2476 one per lane, where the vector register dwords are spilled. The compiler knows
2477 the wavefront size since it generates the code. Note that the type of the
2478 address may have to be converted as the size of a
2479 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2480 ``DW_ASPACE_AMDGPU_private_wave`` address.
2482 .. _amdgpu-dwarf-lane-identifier:
2487 DWARF lane identifies specify a target architecture lane position for hardware
2488 that executes in a SIMD or SIMT manner, and on which a source language maps its
2489 threads of execution onto those lanes. The DWARF lane identifier is pushed by
2490 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2491 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2492 section :ref:`amdgpu-dwarf-operation-expressions`.
2494 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2495 wavefront. It is numbered from 0 to the wavefront size minus 1.
2497 Operation Expressions
2498 ---------------------
2500 DWARF expressions are used to compute program values and the locations of
2501 program objects. See DWARF Version 5 section 2.5 and
2502 :ref:`amdgpu-dwarf-operation-expressions`.
2504 DWARF location descriptions describe how to access storage which includes memory
2505 and registers. When accessing storage on AMDGPU, bytes are ordered with least
2506 significant bytes first, and bits are ordered within bytes with least
2507 significant bits first.
2509 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2510 unwinding vector registers that are spilled under the execution mask to memory:
2511 the zero-single location description is the vector register, and the one-single
2512 location description is the spilled memory location description. The
2513 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2514 memory location description.
2516 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2517 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2518 controlled by the execution mask. An undefined location description together
2519 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2520 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2522 Debugger Information Entry Attributes
2523 -------------------------------------
2525 This section describes how certain debugger information entry attributes are
2526 used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2527 which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2528 :ref:`amdgpu-dwarf-low-level-information` and
2529 :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2531 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2533 ``DW_AT_LLVM_lane_pc``
2534 ~~~~~~~~~~~~~~~~~~~~~~
2536 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2537 location of the separate lanes of a SIMT thread.
2539 If the lane is an active lane then this will be the same as the current program
2542 If the lane is inactive, but was active on entry to the subprogram, then this is
2543 the program location in the subprogram at which execution of the lane is
2544 conceptual positioned.
2546 If the lane was not active on entry to the subprogram, then this will be the
2547 undefined location. A client debugger can check if the lane is part of a valid
2548 work-group by checking that the lane is in the range of the associated
2549 work-group within the grid, accounting for partial work-groups. If it is not,
2550 then the debugger can omit any information for the lane. Otherwise, the debugger
2551 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2552 calling subprogram until it finds a non-undefined location. Conceptually the
2553 lane only has the call frames that it has a non-undefined
2554 ``DW_AT_LLVM_lane_pc``.
2556 The following example illustrates how the AMDGPU backend can generate a DWARF
2557 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2558 following subprogram pseudo code for a target with 64 lanes per wavefront.
2580 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2581 execution mask (``EXEC``) to linearize the control flow. The condition is
2582 evaluated to make a mask of the lanes for which the condition evaluates to true.
2583 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2584 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2585 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2586 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2587 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2588 region. This is shown below. Other approaches are possible, but the basic
2589 concept is the same.
2622 To create the DWARF location list expression that defines the location
2623 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2624 pseudo instruction can be used to annotate the linearized control flow. This can
2625 be done by defining an artificial variable for the lane PC. The DWARF location
2626 list expression created for it is used as the value of the
2627 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2629 A DWARF procedure is defined for each well nested structured control flow region
2630 which provides the conceptual lane program location for a lane if it is not
2631 active (namely it is divergent). The DWARF operation expression for each region
2632 conceptually inherits the value of the immediately enclosing region and modifies
2633 it according to the semantics of the region.
2635 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2636 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2637 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2638 region since the ``THEN`` region has completed.
2640 The lane PC artificial variable is assigned at each region transition. It uses
2641 the immediately enclosing region's DWARF procedure to compute the program
2642 location for each lane assuming they are divergent, and then modifies the result
2643 by inserting the current program location for each lane that the ``EXEC`` mask
2644 indicates is active.
2646 By having separate DWARF procedures for each region, they can be reused to
2647 define the value for any nested region. This reduces the total size of the DWARF
2648 operation expressions.
2650 The following provides an example using pseudo LLVM MIR.
2656 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2657 DW_AT_name = "__uint64";
2658 DW_AT_byte_size = 8;
2659 DW_AT_encoding = DW_ATE_unsigned;
2661 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2662 DW_AT_name = "__active_lane_pc";
2665 DW_OP_LLVM_extend 64, 64;
2666 DW_OP_regval_type EXEC, %uint_64;
2667 DW_OP_LLVM_select_bit_piece 64, 64;
2670 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2671 DW_AT_name = "__divergent_lane_pc";
2673 DW_OP_LLVM_undefined;
2674 DW_OP_LLVM_extend 64, 64;
2677 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2678 DW_OP_call_ref %__divergent_lane_pc;
2679 DW_OP_call_ref %__active_lane_pc;
2683 DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2688 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2689 DW_AT_name = "__divergent_lane_pc_1_then";
2690 DW_AT_location = DIExpression[
2691 DW_OP_call_ref %__divergent_lane_pc;
2692 DW_OP_addrx &lex_1_start;
2694 DW_OP_LLVM_extend 64, 64;
2695 DW_OP_call_ref %__lex_1_save_exec;
2696 DW_OP_deref_type 64, %__uint_64;
2697 DW_OP_LLVM_select_bit_piece 64, 64;
2700 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2701 DW_OP_call_ref %__divergent_lane_pc_1_then;
2702 DW_OP_call_ref %__active_lane_pc;
2706 DBG_VALUE %3, %__lex_1_1_save_exec;
2711 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2712 DW_AT_name = "__divergent_lane_pc_1_1_then";
2713 DW_AT_location = DIExpression[
2714 DW_OP_call_ref %__divergent_lane_pc_1_then;
2715 DW_OP_addrx &lex_1_1_start;
2717 DW_OP_LLVM_extend 64, 64;
2718 DW_OP_call_ref %__lex_1_1_save_exec;
2719 DW_OP_deref_type 64, %__uint_64;
2720 DW_OP_LLVM_select_bit_piece 64, 64;
2723 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2724 DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2725 DW_OP_call_ref %__active_lane_pc;
2730 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2731 DW_AT_name = "__divergent_lane_pc_1_1_else";
2732 DW_AT_location = DIExpression[
2733 DW_OP_call_ref %__divergent_lane_pc_1_then;
2734 DW_OP_addrx &lex_1_1_end;
2736 DW_OP_LLVM_extend 64, 64;
2737 DW_OP_call_ref %__lex_1_1_save_exec;
2738 DW_OP_deref_type 64, %__uint_64;
2739 DW_OP_LLVM_select_bit_piece 64, 64;
2742 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2743 DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2744 DW_OP_call_ref %__active_lane_pc;
2749 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2750 DW_OP_call_ref %__divergent_lane_pc;
2751 DW_OP_call_ref %__active_lane_pc;
2756 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2757 DW_AT_name = "__divergent_lane_pc_1_else";
2758 DW_AT_location = DIExpression[
2759 DW_OP_call_ref %__divergent_lane_pc;
2760 DW_OP_addrx &lex_1_end;
2762 DW_OP_LLVM_extend 64, 64;
2763 DW_OP_call_ref %__lex_1_save_exec;
2764 DW_OP_deref_type 64, %__uint_64;
2765 DW_OP_LLVM_select_bit_piece 64, 64;
2768 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2769 DW_OP_call_ref %__divergent_lane_pc_1_else;
2770 DW_OP_call_ref %__active_lane_pc;
2775 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2776 DW_OP_call_ref %__divergent_lane_pc;
2777 DW_OP_call_ref %__active_lane_pc;
2782 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2783 that are active, with the current program location.
2785 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2786 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2787 instruction, location list entries will be created that describe where the
2788 artificial variables are allocated at any given program location. The compiler
2789 may allocate them to registers or spill them to memory.
2791 The DWARF procedures for each region use the values of the saved execution mask
2792 artificial variables to only update the lanes that are active on entry to the
2793 region. All other lanes retain the value of the enclosing region where they were
2794 last active. If they were not active on entry to the subprogram, then will have
2795 the undefined location description.
2797 Other structured control flow regions can be handled similarly. For example,
2798 loops would set the divergent program location for the region at the end of the
2799 loop. Any lanes active will be in the loop, and any lanes not active must have
2802 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2803 ``IF/THEN/ELSE`` regions.
2805 The DWARF procedures can use the active lane artificial variable described in
2806 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2807 ``EXEC`` mask in order to support whole or quad wavefront mode.
2809 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2811 ``DW_AT_LLVM_active_lane``
2812 ~~~~~~~~~~~~~~~~~~~~~~~~~~
2814 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2815 entry is used to specify the lanes that are conceptually active for a SIMT
2818 The execution mask may be modified to implement whole or quad wavefront mode
2819 operations. For example, all lanes may need to temporarily be made active to
2820 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2821 update it to enable the necessary lanes, perform the operations, and then
2822 restore the ``EXEC`` mask from the saved value. While executing the whole
2823 wavefront region, the conceptual execution mask is the saved value, not the
2826 This is handled by defining an artificial variable for the active lane mask. The
2827 active lane mask artificial variable would be the actual ``EXEC`` mask for
2828 normal regions, and the saved execution mask for regions where the mask is
2829 temporarily updated. The location list expression created for this artificial
2830 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2833 ``DW_AT_LLVM_augmentation``
2834 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2836 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2837 debugger information entry has the following value for the augmentation string:
2843 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2844 extensions used in the DWARF of the compilation unit. The version number
2845 conforms to [SEMVER]_.
2847 Call Frame Information
2848 ----------------------
2850 DWARF Call Frame Information (CFI) describes how a consumer can virtually
2851 *unwind* call frames in a running process or core dump. See DWARF Version 5
2852 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2854 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2856 1. ``augmentation`` string contains the following null-terminated UTF-8 string:
2862 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2863 extensions used in this CIE or to the FDEs that use it. The version number
2864 conforms to [SEMVER]_.
2866 2. ``address_size`` for the ``Global`` address space is defined in
2867 :ref:`amdgpu-dwarf-address-space-identifier`.
2869 3. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2871 4. ``code_alignment_factor`` is 4 bytes.
2875 Add to :ref:`amdgpu-processor-table` table.
2877 5. ``data_alignment_factor`` is 4 bytes.
2881 Add to :ref:`amdgpu-processor-table` table.
2883 6. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2884 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2886 7. ``initial_instructions`` Since a subprogram X with fewer registers can be
2887 called from subprogram Y that has more allocated, X will not change any of
2888 the extra registers as it cannot access them. Therefore, the default rule
2889 for all columns is ``same value``.
2891 For AMDGPU the register number follows the numbering defined in
2892 :ref:`amdgpu-dwarf-register-identifier`.
2894 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2895 the return address to get the address of a byte within the call site
2896 instructions. See DWARF Version 5 section 6.4.4.
2901 See DWARF Version 5 section 6.1.
2903 Lookup By Name Section Header
2904 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2906 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2908 For AMDGPU the lookup by name section header table:
2910 ``augmentation_string_size`` (uword)
2912 Set to the length of the ``augmentation_string`` value which is always a
2915 ``augmentation_string`` (sequence of UTF-8 characters)
2917 Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2923 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2924 extensions used in the DWARF of this index. The version number conforms to
2929 This is different to the DWARF Version 5 definition that requires the first
2930 4 characters to be the vendor ID. But this is consistent with the other
2931 augmentation strings and does allow multiple vendor contributions. However,
2932 backwards compatibility may be more desirable.
2934 Lookup By Address Section Header
2935 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2937 See DWARF Version 5 section 6.1.2.
2939 For AMDGPU the lookup by address section header table:
2941 ``address_size`` (ubyte)
2943 Match the address size for the ``Global`` address space defined in
2944 :ref:`amdgpu-dwarf-address-space-identifier`.
2946 ``segment_selector_size`` (ubyte)
2948 AMDGPU does not use a segment selector so this is 0. The entries in the
2949 ``.debug_aranges`` do not have a segment selector.
2951 Line Number Information
2952 -----------------------
2954 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2956 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2957 The instruction set must be obtained from the ELF file header ``e_flags`` field
2958 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2959 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2963 Should the ``isa`` state machine register be used to indicate if the code is
2964 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2966 For AMDGPU the line number program header fields have the following values (see
2967 DWARF Version 5 section 6.2.4):
2969 ``address_size`` (ubyte)
2970 Matches the address size for the ``Global`` address space defined in
2971 :ref:`amdgpu-dwarf-address-space-identifier`.
2973 ``segment_selector_size`` (ubyte)
2974 AMDGPU does not use a segment selector so this is 0.
2976 ``minimum_instruction_length`` (ubyte)
2977 For GFX9-GFX11 this is 4.
2979 ``maximum_operations_per_instruction`` (ubyte)
2980 For GFX9-GFX11 this is 1.
2982 Source text for online-compiled programs (for example, those compiled by the
2983 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2984 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2985 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2986 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2988 The Clang option used to control source embedding in AMDGPU is defined in
2989 :ref:`amdgpu-clang-debug-options-table`.
2991 .. table:: AMDGPU Clang Debug Options
2992 :name: amdgpu-clang-debug-options-table
2994 ==================== ==================================================
2995 Debug Flag Description
2996 ==================== ==================================================
2997 -g[no-]embed-source Enable/disable embedding source text in DWARF
2998 debug sections. Useful for environments where
2999 source cannot be written to disk, such as
3000 when performing online compilation.
3001 ==================== ==================================================
3006 Enable the embedded source.
3008 ``-gno-embed-source``
3009 Disable the embedded source.
3011 32-Bit and 64-Bit DWARF Formats
3012 -------------------------------
3014 See DWARF Version 5 section 7.4 and
3015 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
3019 * For the ``amdgcn`` target architecture only the 64-bit process address space
3022 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
3023 the 32-bit DWARF format.
3028 For AMDGPU the following values apply for each of the unit headers described in
3029 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
3031 ``address_size`` (ubyte)
3032 Matches the address size for the ``Global`` address space defined in
3033 :ref:`amdgpu-dwarf-address-space-identifier`.
3035 .. _amdgpu-code-conventions:
3040 This section provides code conventions used for each supported target triple OS
3041 (see :ref:`amdgpu-target-triples`).
3046 This section provides code conventions used when the target triple OS is
3047 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
3049 .. _amdgpu-amdhsa-code-object-metadata:
3051 Code Object Metadata
3052 ~~~~~~~~~~~~~~~~~~~~
3054 The code object metadata specifies extensible metadata associated with the code
3055 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
3056 encoding and semantics of this metadata depends on the code object version; see
3057 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
3058 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
3059 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
3060 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
3062 Code object metadata is specified in a note record (see
3063 :ref:`amdgpu-note-records`) and is required when the target triple OS is
3064 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
3065 information necessary to support the HSA compatible runtime kernel queries. For
3066 example, the segment sizes needed in a dispatch packet. In addition, a
3067 high-level language runtime may require other information to be included. For
3068 example, the AMD OpenCL runtime records kernel argument information.
3070 .. _amdgpu-amdhsa-code-object-metadata-v2:
3072 Code Object V2 Metadata
3073 +++++++++++++++++++++++
3076 Code object V2 generation is no longer supported by this version of LLVM.
3078 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
3079 (see :ref:`amdgpu-note-records-v2`).
3081 The metadata is specified as a YAML formatted string (see [YAML]_ and
3086 Is the string null terminated? It probably should not if YAML allows it to
3087 contain null characters, otherwise it should be.
3089 The metadata is represented as a single YAML document comprised of the mapping
3090 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
3093 For boolean values, the string values of ``false`` and ``true`` are used for
3094 false and true respectively.
3096 Additional information can be added to the mappings. To avoid conflicts, any
3097 non-AMD key names should be prefixed by "*vendor-name*.".
3099 .. table:: AMDHSA Code Object V2 Metadata Map
3100 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
3102 ========== ============== ========= =======================================
3103 String Key Value Type Required? Description
3104 ========== ============== ========= =======================================
3105 "Version" sequence of Required - The first integer is the major
3106 2 integers version. Currently 1.
3107 - The second integer is the minor
3108 version. Currently 0.
3109 "Printf" sequence of Each string is encoded information
3110 strings about a printf function call. The
3111 encoded information is organized as
3112 fields separated by colon (':'):
3114 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3119 A 32-bit integer as a unique id for
3120 each printf function call
3123 A 32-bit integer equal to the number
3124 of arguments of printf function call
3127 ``S[i]`` (where i = 0, 1, ... , N-1)
3128 32-bit integers for the size in bytes
3129 of the i-th FormatString argument of
3130 the printf function call
3133 The format string passed to the
3134 printf function call.
3135 "Kernels" sequence of Required Sequence of the mappings for each
3136 mapping kernel in the code object. See
3137 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
3138 for the definition of the mapping.
3139 ========== ============== ========= =======================================
3143 .. table:: AMDHSA Code Object V2 Kernel Metadata Map
3144 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
3146 ================= ============== ========= ================================
3147 String Key Value Type Required? Description
3148 ================= ============== ========= ================================
3149 "Name" string Required Source name of the kernel.
3150 "SymbolName" string Required Name of the kernel
3151 descriptor ELF symbol.
3152 "Language" string Source language of the kernel.
3160 "LanguageVersion" sequence of - The first integer is the major
3162 - The second integer is the
3164 "Attrs" mapping Mapping of kernel attributes.
3166 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
3167 for the mapping definition.
3168 "Args" sequence of Sequence of mappings of the
3169 mapping kernel arguments. See
3170 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
3171 for the definition of the mapping.
3172 "CodeProps" mapping Mapping of properties related to
3173 the kernel code. See
3174 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
3175 for the mapping definition.
3176 ================= ============== ========= ================================
3180 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
3181 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
3183 =================== ============== ========= ==============================
3184 String Key Value Type Required? Description
3185 =================== ============== ========= ==============================
3186 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
3187 3 integers must be >=1 and the dispatch
3188 work-group size X, Y, Z must
3189 correspond to the specified
3190 values. Defaults to 0, 0, 0.
3192 Corresponds to the OpenCL
3193 ``reqd_work_group_size``
3195 "WorkGroupSizeHint" sequence of The dispatch work-group size
3196 3 integers X, Y, Z is likely to be the
3199 Corresponds to the OpenCL
3200 ``work_group_size_hint``
3202 "VecTypeHint" string The name of a scalar or vector
3205 Corresponds to the OpenCL
3206 ``vec_type_hint`` attribute.
3208 "RuntimeHandle" string The external symbol name
3209 associated with a kernel.
3210 OpenCL runtime allocates a
3211 global buffer for the symbol
3212 and saves the kernel's address
3213 to it, which is used for
3214 device side enqueueing. Only
3215 available for device side
3217 =================== ============== ========= ==============================
3221 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
3222 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
3224 ================= ============== ========= ================================
3225 String Key Value Type Required? Description
3226 ================= ============== ========= ================================
3227 "Name" string Kernel argument name.
3228 "TypeName" string Kernel argument type name.
3229 "Size" integer Required Kernel argument size in bytes.
3230 "Align" integer Required Kernel argument alignment in
3231 bytes. Must be a power of two.
3232 "ValueKind" string Required Kernel argument kind that
3233 specifies how to set up the
3234 corresponding argument.
3238 The argument is copied
3239 directly into the kernarg.
3242 A global address space pointer
3243 to the buffer data is passed
3246 "DynamicSharedPointer"
3247 A group address space pointer
3248 to dynamically allocated LDS
3249 is passed in the kernarg.
3252 A global address space
3253 pointer to a S# is passed in
3257 A global address space
3258 pointer to a T# is passed in
3262 A global address space pointer
3263 to an OpenCL pipe is passed in
3267 A global address space pointer
3268 to an OpenCL device enqueue
3269 queue is passed in the
3272 "HiddenGlobalOffsetX"
3273 The OpenCL grid dispatch
3274 global offset for the X
3275 dimension is passed in the
3278 "HiddenGlobalOffsetY"
3279 The OpenCL grid dispatch
3280 global offset for the Y
3281 dimension is passed in the
3284 "HiddenGlobalOffsetZ"
3285 The OpenCL grid dispatch
3286 global offset for the Z
3287 dimension is passed in the
3291 An argument that is not used
3292 by the kernel. Space needs to
3293 be left for it, but it does
3294 not need to be set up.
3296 "HiddenPrintfBuffer"
3297 A global address space pointer
3298 to the runtime printf buffer
3299 is passed in kernarg. Mutually
3301 "HiddenHostcallBuffer".
3303 "HiddenHostcallBuffer"
3304 A global address space pointer
3305 to the runtime hostcall buffer
3306 is passed in kernarg. Mutually
3308 "HiddenPrintfBuffer".
3310 "HiddenDefaultQueue"
3311 A global address space pointer
3312 to the OpenCL device enqueue
3313 queue that should be used by
3314 the kernel by default is
3315 passed in the kernarg.
3317 "HiddenCompletionAction"
3318 A global address space pointer
3319 to help link enqueued kernels into
3320 the ancestor tree for determining
3321 when the parent kernel has finished.
3323 "HiddenMultiGridSyncArg"
3324 A global address space pointer for
3325 multi-grid synchronization is
3326 passed in the kernarg.
3328 "ValueType" string Unused and deprecated. This should no longer
3329 be emitted, but is accepted for compatibility.
3332 "PointeeAlign" integer Alignment in bytes of pointee
3333 type for pointer type kernel
3334 argument. Must be a power
3335 of 2. Only present if
3337 "DynamicSharedPointer".
3338 "AddrSpaceQual" string Kernel argument address space
3339 qualifier. Only present if
3340 "ValueKind" is "GlobalBuffer" or
3341 "DynamicSharedPointer". Values
3353 Is GlobalBuffer only Global
3355 DynamicSharedPointer always
3356 Local? Can HCC allow Generic?
3357 How can Private or Region
3360 "AccQual" string Kernel argument access
3361 qualifier. Only present if
3362 "ValueKind" is "Image" or
3375 "ActualAccQual" string The actual memory accesses
3376 performed by the kernel on the
3377 kernel argument. Only present if
3378 "ValueKind" is "GlobalBuffer",
3379 "Image", or "Pipe". This may be
3380 more restrictive than indicated
3381 by "AccQual" to reflect what the
3382 kernel actual does. If not
3383 present then the runtime must
3384 assume what is implied by
3385 "AccQual" and "IsConst". Values
3392 "IsConst" boolean Indicates if the kernel argument
3393 is const qualified. Only present
3397 "IsRestrict" boolean Indicates if the kernel argument
3398 is restrict qualified. Only
3399 present if "ValueKind" is
3402 "IsVolatile" boolean Indicates if the kernel argument
3403 is volatile qualified. Only
3404 present if "ValueKind" is
3407 "IsPipe" boolean Indicates if the kernel argument
3408 is pipe qualified. Only present
3409 if "ValueKind" is "Pipe".
3413 Can GlobalBuffer be pipe
3416 ================= ============== ========= ================================
3420 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
3421 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
3423 ============================ ============== ========= =====================
3424 String Key Value Type Required? Description
3425 ============================ ============== ========= =====================
3426 "KernargSegmentSize" integer Required The size in bytes of
3428 that holds the values
3431 "GroupSegmentFixedSize" integer Required The amount of group
3435 bytes. This does not
3437 dynamically allocated
3438 group segment memory
3442 "PrivateSegmentFixedSize" integer Required The amount of fixed
3443 private address space
3444 memory required for a
3446 bytes. If the kernel
3448 stack then additional
3450 to this value for the
3452 "KernargSegmentAlign" integer Required The maximum byte
3455 kernarg segment. Must
3457 "WavefrontSize" integer Required Wavefront size. Must
3459 "NumSGPRs" integer Required Number of scalar
3463 includes the special
3465 Scratch (GFX7-GFX10)
3467 GFX8-GFX10). It does
3469 SGPR added if a trap
3475 "NumVGPRs" integer Required Number of vector
3479 "MaxFlatWorkGroupSize" integer Required Maximum flat
3482 kernel in work-items.
3485 ReqdWorkGroupSize if
3487 "NumSpilledSGPRs" integer Number of stores from
3488 a scalar register to
3489 a register allocator
3492 "NumSpilledVGPRs" integer Number of stores from
3493 a vector register to
3494 a register allocator
3497 ============================ ============== ========= =====================
3499 .. _amdgpu-amdhsa-code-object-metadata-v3:
3501 Code Object V3 Metadata
3502 +++++++++++++++++++++++
3505 Code object V3 generation is no longer supported by this version of LLVM.
3507 Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3508 record (see :ref:`amdgpu-note-records-v3-onwards`).
3510 The metadata is represented as Message Pack formatted binary data (see
3511 [MsgPack]_). The top level is a Message Pack map that includes the
3512 keys defined in table
3513 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3516 Additional information can be added to the maps. To avoid conflicts,
3517 any key names should be prefixed by "*vendor-name*." where
3518 ``vendor-name`` can be the name of the vendor and specific vendor
3519 tool that generates the information. The prefix is abbreviated to
3520 simply "." when it appears within a map that has been added by the
3523 .. table:: AMDHSA Code Object V3 Metadata Map
3524 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3526 ================= ============== ========= =======================================
3527 String Key Value Type Required? Description
3528 ================= ============== ========= =======================================
3529 "amdhsa.version" sequence of Required - The first integer is the major
3530 2 integers version. Currently 1.
3531 - The second integer is the minor
3532 version. Currently 0.
3533 "amdhsa.printf" sequence of Each string is encoded information
3534 strings about a printf function call. The
3535 encoded information is organized as
3536 fields separated by colon (':'):
3538 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3543 A 32-bit integer as a unique id for
3544 each printf function call
3547 A 32-bit integer equal to the number
3548 of arguments of printf function call
3551 ``S[i]`` (where i = 0, 1, ... , N-1)
3552 32-bit integers for the size in bytes
3553 of the i-th FormatString argument of
3554 the printf function call
3557 The format string passed to the
3558 printf function call.
3559 "amdhsa.kernels" sequence of Required Sequence of the maps for each
3560 map kernel in the code object. See
3561 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3562 for the definition of the keys included
3564 ================= ============== ========= =======================================
3568 .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3569 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3571 =================================== ============== ========= ================================
3572 String Key Value Type Required? Description
3573 =================================== ============== ========= ================================
3574 ".name" string Required Source name of the kernel.
3575 ".symbol" string Required Name of the kernel
3576 descriptor ELF symbol.
3577 ".language" string Source language of the kernel.
3587 ".language_version" sequence of - The first integer is the major
3589 - The second integer is the
3591 ".args" sequence of Sequence of maps of the
3592 map kernel arguments. See
3593 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3594 for the definition of the keys
3595 included in that map.
3596 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
3597 3 integers must be >=1 and the dispatch
3598 work-group size X, Y, Z must
3599 correspond to the specified
3600 values. Defaults to 0, 0, 0.
3602 Corresponds to the OpenCL
3603 ``reqd_work_group_size``
3605 ".workgroup_size_hint" sequence of The dispatch work-group size
3606 3 integers X, Y, Z is likely to be the
3609 Corresponds to the OpenCL
3610 ``work_group_size_hint``
3612 ".vec_type_hint" string The name of a scalar or vector
3615 Corresponds to the OpenCL
3616 ``vec_type_hint`` attribute.
3618 ".device_enqueue_symbol" string The external symbol name
3619 associated with a kernel.
3620 OpenCL runtime allocates a
3621 global buffer for the symbol
3622 and saves the kernel's address
3623 to it, which is used for
3624 device side enqueueing. Only
3625 available for device side
3627 ".kernarg_segment_size" integer Required The size in bytes of
3629 that holds the values
3632 ".group_segment_fixed_size" integer Required The amount of group
3636 bytes. This does not
3638 dynamically allocated
3639 group segment memory
3643 ".private_segment_fixed_size" integer Required The amount of fixed
3644 private address space
3645 memory required for a
3647 bytes. If the kernel
3649 stack then additional
3651 to this value for the
3653 ".kernarg_segment_align" integer Required The maximum byte
3656 kernarg segment. Must
3658 ".wavefront_size" integer Required Wavefront size. Must
3660 ".sgpr_count" integer Required Number of scalar
3661 registers required by a
3663 GFX6-GFX9. A register
3664 is required if it is
3666 if a higher numbered
3669 includes the special
3675 SGPR added if a trap
3681 ".vgpr_count" integer Required Number of vector
3682 registers required by
3684 GFX6-GFX9. A register
3685 is required if it is
3687 if a higher numbered
3690 ".agpr_count" integer Required Number of accumulator
3691 registers required by
3694 ".max_flat_workgroup_size" integer Required Maximum flat
3697 kernel in work-items.
3700 ReqdWorkGroupSize if
3702 ".sgpr_spill_count" integer Number of stores from
3703 a scalar register to
3704 a register allocator
3707 ".vgpr_spill_count" integer Number of stores from
3708 a vector register to
3709 a register allocator
3712 ".kind" string The kind of the kernel
3720 These kernels must be
3721 invoked after loading
3731 These kernels must be
3734 containing code object
3735 and after all init and
3736 normal kernels in the
3737 same code object have
3741 If omitted, "normal" is
3743 =================================== ============== ========= ================================
3747 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3748 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3750 ====================== ============== ========= ================================
3751 String Key Value Type Required? Description
3752 ====================== ============== ========= ================================
3753 ".name" string Kernel argument name.
3754 ".type_name" string Kernel argument type name.
3755 ".size" integer Required Kernel argument size in bytes.
3756 ".offset" integer Required Kernel argument offset in
3757 bytes. The offset must be a
3758 multiple of the alignment
3759 required by the argument.
3760 ".value_kind" string Required Kernel argument kind that
3761 specifies how to set up the
3762 corresponding argument.
3766 The argument is copied
3767 directly into the kernarg.
3770 A global address space pointer
3771 to the buffer data is passed
3774 "dynamic_shared_pointer"
3775 A group address space pointer
3776 to dynamically allocated LDS
3777 is passed in the kernarg.
3780 A global address space
3781 pointer to a S# is passed in
3785 A global address space
3786 pointer to a T# is passed in
3790 A global address space pointer
3791 to an OpenCL pipe is passed in
3795 A global address space pointer
3796 to an OpenCL device enqueue
3797 queue is passed in the
3800 "hidden_global_offset_x"
3801 The OpenCL grid dispatch
3802 global offset for the X
3803 dimension is passed in the
3806 "hidden_global_offset_y"
3807 The OpenCL grid dispatch
3808 global offset for the Y
3809 dimension is passed in the
3812 "hidden_global_offset_z"
3813 The OpenCL grid dispatch
3814 global offset for the Z
3815 dimension is passed in the
3819 An argument that is not used
3820 by the kernel. Space needs to
3821 be left for it, but it does
3822 not need to be set up.
3824 "hidden_printf_buffer"
3825 A global address space pointer
3826 to the runtime printf buffer
3827 is passed in kernarg. Mutually
3829 "hidden_hostcall_buffer"
3830 before Code Object V5.
3832 "hidden_hostcall_buffer"
3833 A global address space pointer
3834 to the runtime hostcall buffer
3835 is passed in kernarg. Mutually
3837 "hidden_printf_buffer"
3838 before Code Object V5.
3840 "hidden_default_queue"
3841 A global address space pointer
3842 to the OpenCL device enqueue
3843 queue that should be used by
3844 the kernel by default is
3845 passed in the kernarg.
3847 "hidden_completion_action"
3848 A global address space pointer
3849 to help link enqueued kernels into
3850 the ancestor tree for determining
3851 when the parent kernel has finished.
3853 "hidden_multigrid_sync_arg"
3854 A global address space pointer for
3855 multi-grid synchronization is
3856 passed in the kernarg.
3858 ".value_type" string Unused and deprecated. This should no longer
3859 be emitted, but is accepted for compatibility.
3861 ".pointee_align" integer Alignment in bytes of pointee
3862 type for pointer type kernel
3863 argument. Must be a power
3864 of 2. Only present if
3866 "dynamic_shared_pointer".
3867 ".address_space" string Kernel argument address space
3868 qualifier. Only present if
3869 ".value_kind" is "global_buffer" or
3870 "dynamic_shared_pointer". Values
3882 Is "global_buffer" only "global"
3884 "dynamic_shared_pointer" always
3885 "local"? Can HCC allow "generic"?
3886 How can "private" or "region"
3889 ".access" string Kernel argument access
3890 qualifier. Only present if
3891 ".value_kind" is "image" or
3904 ".actual_access" string The actual memory accesses
3905 performed by the kernel on the
3906 kernel argument. Only present if
3907 ".value_kind" is "global_buffer",
3908 "image", or "pipe". This may be
3909 more restrictive than indicated
3910 by ".access" to reflect what the
3911 kernel actual does. If not
3912 present then the runtime must
3913 assume what is implied by
3914 ".access" and ".is_const" . Values
3921 ".is_const" boolean Indicates if the kernel argument
3922 is const qualified. Only present
3926 ".is_restrict" boolean Indicates if the kernel argument
3927 is restrict qualified. Only
3928 present if ".value_kind" is
3931 ".is_volatile" boolean Indicates if the kernel argument
3932 is volatile qualified. Only
3933 present if ".value_kind" is
3936 ".is_pipe" boolean Indicates if the kernel argument
3937 is pipe qualified. Only present
3938 if ".value_kind" is "pipe".
3942 Can "global_buffer" be pipe
3945 ====================== ============== ========= ================================
3947 .. _amdgpu-amdhsa-code-object-metadata-v4:
3949 Code Object V4 Metadata
3950 +++++++++++++++++++++++
3953 Code object V4 is not the default code object version emitted by this version
3956 Code object V4 metadata is the same as
3957 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3958 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
3960 .. table:: AMDHSA Code Object V4 Metadata Map Changes
3961 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3963 ================= ============== ========= =======================================
3964 String Key Value Type Required? Description
3965 ================= ============== ========= =======================================
3966 "amdhsa.version" sequence of Required - The first integer is the major
3967 2 integers version. Currently 1.
3968 - The second integer is the minor
3969 version. Currently 1.
3970 "amdhsa.target" string Required The target name of the code using the syntax:
3974 <target-triple> [ "-" <target-id> ]
3976 A canonical target ID must be
3977 used. See :ref:`amdgpu-target-triples`
3978 and :ref:`amdgpu-target-id`.
3979 ================= ============== ========= =======================================
3981 .. _amdgpu-amdhsa-code-object-metadata-v5:
3983 Code Object V5 Metadata
3984 +++++++++++++++++++++++
3986 Code object V5 metadata is the same as
3987 :ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
3988 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5`, table
3989 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5` and table
3990 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
3992 .. table:: AMDHSA Code Object V5 Metadata Map Changes
3993 :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
3995 ================= ============== ========= =======================================
3996 String Key Value Type Required? Description
3997 ================= ============== ========= =======================================
3998 "amdhsa.version" sequence of Required - The first integer is the major
3999 2 integers version. Currently 1.
4000 - The second integer is the minor
4001 version. Currently 2.
4002 ================= ============== ========= =======================================
4006 .. table:: AMDHSA Code Object V5 Kernel Metadata Map Additions
4007 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5
4009 ============================= ============= ========== =======================================
4010 String Key Value Type Required? Description
4011 ============================= ============= ========== =======================================
4012 ".uses_dynamic_stack" boolean Indicates if the generated machine code
4013 is using a dynamically sized stack.
4014 ".workgroup_processor_mode" boolean (GFX10+) Controls ENABLE_WGP_MODE in
4015 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
4016 ============================= ============= ========== =======================================
4020 .. table:: AMDHSA Code Object V5 Kernel Attribute Metadata Map
4021 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v5-table
4023 =========================== ============== ========= ==============================
4024 String Key Value Type Required? Description
4025 =========================== ============== ========= ==============================
4026 ".uniform_work_group_size" integer Indicates if the kernel
4027 requires that each dimension
4028 of global size is a multiple
4029 of corresponding dimension of
4030 work-group size. Value of 1
4031 implies true and value of 0
4032 implies false. Metadata is
4033 only emitted when value is 1.
4034 =========================== ============== ========= ==============================
4040 .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
4041 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
4043 ====================== ============== ========= ================================
4044 String Key Value Type Required? Description
4045 ====================== ============== ========= ================================
4046 ".value_kind" string Required Kernel argument kind that
4047 specifies how to set up the
4048 corresponding argument.
4050 the same as code object V3 metadata
4051 (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
4052 with the following additions:
4054 "hidden_block_count_x"
4055 The grid dispatch work-group count for the X dimension
4056 is passed in the kernarg. Some languages, such as OpenCL,
4057 support a last work-group in each dimension being partial.
4058 This count only includes the non-partial work-group count.
4059 This is not the same as the value in the AQL dispatch packet,
4060 which has the grid size in work-items.
4062 "hidden_block_count_y"
4063 The grid dispatch work-group count for the Y dimension
4064 is passed in the kernarg. Some languages, such as OpenCL,
4065 support a last work-group in each dimension being partial.
4066 This count only includes the non-partial work-group count.
4067 This is not the same as the value in the AQL dispatch packet,
4068 which has the grid size in work-items. If the grid dimensionality
4069 is 1, then must be 1.
4071 "hidden_block_count_z"
4072 The grid dispatch work-group count for the Z dimension
4073 is passed in the kernarg. Some languages, such as OpenCL,
4074 support a last work-group in each dimension being partial.
4075 This count only includes the non-partial work-group count.
4076 This is not the same as the value in the AQL dispatch packet,
4077 which has the grid size in work-items. If the grid dimensionality
4078 is 1 or 2, then must be 1.
4080 "hidden_group_size_x"
4081 The grid dispatch work-group size for the X dimension is
4082 passed in the kernarg. This size only applies to the
4083 non-partial work-groups. This is the same value as the AQL
4084 dispatch packet work-group size.
4086 "hidden_group_size_y"
4087 The grid dispatch work-group size for the Y dimension is
4088 passed in the kernarg. This size only applies to the
4089 non-partial work-groups. This is the same value as the AQL
4090 dispatch packet work-group size. If the grid dimensionality
4091 is 1, then must be 1.
4093 "hidden_group_size_z"
4094 The grid dispatch work-group size for the Z dimension is
4095 passed in the kernarg. This size only applies to the
4096 non-partial work-groups. This is the same value as the AQL
4097 dispatch packet work-group size. If the grid dimensionality
4098 is 1 or 2, then must be 1.
4100 "hidden_remainder_x"
4101 The grid dispatch work group size of the partial work group
4102 of the X dimension, if it exists. Must be zero if a partial
4103 work group does not exist in the X dimension.
4105 "hidden_remainder_y"
4106 The grid dispatch work group size of the partial work group
4107 of the Y dimension, if it exists. Must be zero if a partial
4108 work group does not exist in the Y dimension.
4110 "hidden_remainder_z"
4111 The grid dispatch work group size of the partial work group
4112 of the Z dimension, if it exists. Must be zero if a partial
4113 work group does not exist in the Z dimension.
4116 The grid dispatch dimensionality. This is the same value
4117 as the AQL dispatch packet dimensionality. Must be a value
4121 A global address space pointer to an initialized memory
4122 buffer that conforms to the requirements of the malloc/free
4123 device library V1 version implementation.
4125 "hidden_dynamic_lds_size"
4126 Size of the dynamically allocated LDS memory is passed in the kernarg.
4128 "hidden_private_base"
4129 The high 32 bits of the flat addressing private aperture base.
4130 Only used by GFX8 to allow conversion between private segment
4131 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4133 "hidden_shared_base"
4134 The high 32 bits of the flat addressing shared aperture base.
4135 Only used by GFX8 to allow conversion between shared segment
4136 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4139 A global memory address space pointer to the ROCm runtime
4140 ``struct amd_queue_t`` structure for the HSA queue of the
4141 associated dispatch AQL packet. It is only required for pre-GFX9
4142 devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
4144 ====================== ============== ========= ================================
4151 The HSA architected queuing language (AQL) defines a user space memory interface
4152 that can be used to control the dispatch of kernels, in an agent independent
4153 way. An agent can have zero or more AQL queues created for it using an HSA
4154 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
4155 are 64 bytes) can be placed. See the *HSA Platform System Architecture
4156 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
4158 The packet processor of a kernel agent is responsible for detecting and
4159 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
4160 packet processor is implemented by the hardware command processor (CP),
4161 asynchronous dispatch controller (ADC) and shader processor input controller
4164 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
4165 the kernel mode driver to initialize and register the AQL queue with CP.
4167 To dispatch a kernel the following actions are performed. This can occur in the
4168 CPU host program, or from an HSA kernel executing on a GPU.
4170 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
4171 executed is obtained.
4172 2. A pointer to the kernel descriptor (see
4173 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
4174 It must be for a kernel that is contained in a code object that was loaded
4175 by an HSA compatible runtime on the kernel agent with which the AQL queue is
4177 3. Space is allocated for the kernel arguments using the HSA compatible runtime
4178 allocator for a memory region with the kernarg property for the kernel agent
4179 that will execute the kernel. It must be at least 16-byte aligned.
4180 4. Kernel argument values are assigned to the kernel argument memory
4181 allocation. The layout is defined in the *HSA Programmer's Language
4182 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
4183 kernel argument memory in the same way constant memory is accessed. (Note
4184 that the HSA specification allows an implementation to copy the kernel
4185 argument contents to another location that is accessed by the kernel.)
4186 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
4187 runtime api uses 64-bit atomic operations to reserve space in the AQL queue
4188 for the packet. The packet must be set up, and the final write must use an
4189 atomic store release to set the packet kind to ensure the packet contents are
4190 visible to the kernel agent. AQL defines a doorbell signal mechanism to
4191 notify the kernel agent that the AQL queue has been updated. These rules, and
4192 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
4193 System Architecture Specification* [HSA]_.
4194 6. A kernel dispatch packet includes information about the actual dispatch,
4195 such as grid and work-group size, together with information from the code
4196 object about the kernel, such as segment sizes. The HSA compatible runtime
4197 queries on the kernel symbol can be used to obtain the code object values
4198 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
4199 7. CP executes micro-code and is responsible for detecting and setting up the
4200 GPU to execute the wavefronts of a kernel dispatch.
4201 8. CP ensures that when the a wavefront starts executing the kernel machine
4202 code, the scalar general purpose registers (SGPR) and vector general purpose
4203 registers (VGPR) are set up as required by the machine code. The required
4204 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
4205 register state is defined in
4206 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4207 9. The prolog of the kernel machine code (see
4208 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
4209 before continuing executing the machine code that corresponds to the kernel.
4210 10. When the kernel dispatch has completed execution, CP signals the completion
4211 signal specified in the kernel dispatch packet if not 0.
4213 .. _amdgpu-amdhsa-memory-spaces:
4218 The memory space properties are:
4220 .. table:: AMDHSA Memory Spaces
4221 :name: amdgpu-amdhsa-memory-spaces-table
4223 ================= =========== ======== ======= ==================
4224 Memory Space Name HSA Segment Hardware Address NULL Value
4226 ================= =========== ======== ======= ==================
4227 Private private scratch 32 0x00000000
4228 Local group LDS 32 0xFFFFFFFF
4229 Global global global 64 0x0000000000000000
4230 Constant constant *same as 64 0x0000000000000000
4232 Generic flat flat 64 0x0000000000000000
4233 Region N/A GDS 32 *not implemented
4235 ================= =========== ======== ======= ==================
4237 The global and constant memory spaces both use global virtual addresses, which
4238 are the same virtual address space used by the CPU. However, some virtual
4239 addresses may only be accessible to the CPU, some only accessible by the GPU,
4242 Using the constant memory space indicates that the data will not change during
4243 the execution of the kernel. This allows scalar read instructions to be
4244 used. The vector and scalar L1 caches are invalidated of volatile data before
4245 each kernel dispatch execution to allow constant memory to change values between
4248 The local memory space uses the hardware Local Data Store (LDS) which is
4249 automatically allocated when the hardware creates work-groups of wavefronts, and
4250 freed when all the wavefronts of a work-group have terminated. The data store
4251 (DS) instructions can be used to access it.
4253 The private memory space uses the hardware scratch memory support. If the kernel
4254 uses scratch, then the hardware allocates memory that is accessed using
4255 wavefront lane dword (4 byte) interleaving. The mapping used from private
4256 address to physical address is:
4258 ``wavefront-scratch-base +
4259 (private-address * wavefront-size * 4) +
4260 (wavefront-lane-id * 4)``
4262 There are different ways that the wavefront scratch base address is determined
4263 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
4264 memory can be accessed in an interleaved manner using buffer instruction with
4265 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
4266 instructions, or by flat instructions. If each lane of a wavefront accesses the
4267 same private address, the interleaving results in adjacent dwords being accessed
4268 and hence requires fewer cache lines to be fetched. Multi-dword access is not
4269 supported except by flat and scratch instructions in GFX9-GFX11.
4271 The generic address space uses the hardware flat address support available in
4272 GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
4273 local apertures), that are outside the range of addressible global memory, to
4274 map from a flat address to a private or local address.
4276 FLAT instructions can take a flat address and access global, private (scratch)
4277 and group (LDS) memory depending on if the address is within one of the
4278 aperture ranges. Flat access to scratch requires hardware aperture setup and
4279 setup in the kernel prologue (see
4280 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
4281 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
4282 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
4284 To convert between a segment address and a flat address the base address of the
4285 apertures address can be used. For GFX7-GFX8 these are available in the
4286 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
4287 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
4288 GFX9-GFX11 the aperture base addresses are directly available as inline constant
4289 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
4290 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
4291 which makes it easier to convert from flat to segment or segment to flat.
4296 Image and sample handles created by an HSA compatible runtime (see
4297 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
4298 object respectively. In order to support the HSA ``query_sampler`` operations
4299 two extra dwords are used to store the HSA BRIG enumeration values for the
4300 queries that are not trivially deducible from the S# representation.
4305 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
4306 are 64-bit addresses of a structure allocated in memory accessible from both the
4307 CPU and GPU. The structure is defined by the runtime and subject to change
4308 between releases. For example, see [AMD-ROCm-github]_.
4310 .. _amdgpu-amdhsa-hsa-aql-queue:
4315 The HSA AQL queue structure is defined by an HSA compatible runtime (see
4316 :ref:`amdgpu-os`) and subject to change between releases. For example, see
4317 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
4318 certain language features such as the flat address aperture bases. It also
4319 contains fields used by CP such as managing the allocation of scratch memory.
4321 .. _amdgpu-amdhsa-kernel-descriptor:
4326 A kernel descriptor consists of the information needed by CP to initiate the
4327 execution of a kernel, including the entry point address of the machine code
4328 that implements the kernel.
4330 Code Object V3 Kernel Descriptor
4331 ++++++++++++++++++++++++++++++++
4333 CP microcode requires the Kernel descriptor to be allocated on 64-byte
4336 The fields used by CP for code objects before V3 also match those specified in
4337 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
4339 .. table:: Code Object V3 Kernel Descriptor
4340 :name: amdgpu-amdhsa-kernel-descriptor-v3-table
4342 ======= ======= =============================== ============================
4343 Bits Size Field Name Description
4344 ======= ======= =============================== ============================
4345 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
4346 address space memory
4347 required for a work-group
4348 in bytes. This does not
4349 include any dynamically
4350 allocated local address
4351 space memory that may be
4352 added when the kernel is
4354 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
4355 private address space
4356 memory required for a
4357 work-item in bytes. When
4358 this cannot be predicted,
4359 code object v4 and older
4360 sets this value to be
4361 higher than the minimum
4363 95:64 4 bytes KERNARG_SIZE The size of the kernarg
4364 memory pointed to by the
4365 AQL dispatch packet. The
4366 kernarg memory is used to
4367 pass arguments to the
4370 * If the kernarg pointer in
4371 the dispatch packet is NULL
4372 then there are no kernel
4374 * If the kernarg pointer in
4375 the dispatch packet is
4376 not NULL and this value
4377 is 0 then the kernarg
4380 * If the kernarg pointer in
4381 the dispatch packet is
4382 not NULL and this value
4383 is not 0 then the value
4384 specifies the kernarg
4385 memory size in bytes. It
4386 is recommended to provide
4387 a value as it may be used
4388 by CP to optimize making
4390 visible to the kernel
4393 127:96 4 bytes Reserved, must be 0.
4394 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
4397 descriptor to kernel's
4398 entry point instruction
4399 which must be 256 byte
4401 351:272 20 Reserved, must be 0.
4403 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9
4404 Reserved, must be 0.
4407 program settings used by
4409 ``COMPUTE_PGM_RSRC3``
4412 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
4415 program settings used by
4417 ``COMPUTE_PGM_RSRC3``
4420 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
4423 program settings used by
4425 ``COMPUTE_PGM_RSRC3``
4428 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx12-table`.
4429 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
4430 program settings used by
4432 ``COMPUTE_PGM_RSRC1``
4435 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
4436 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
4437 program settings used by
4439 ``COMPUTE_PGM_RSRC2``
4442 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
4443 458:448 7 bits *See separate bits below.* Enable the setup of the
4444 SGPR user data registers
4446 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4448 The total number of SGPR
4450 requested must not exceed
4451 16 and match value in
4452 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
4453 Any requests beyond 16
4455 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties*
4457 :ref:`amdgpu-processor-table`
4458 specifies *Architected flat
4459 scratch* then not supported
4461 >449 1 bit ENABLE_SGPR_DISPATCH_PTR
4462 >450 1 bit ENABLE_SGPR_QUEUE_PTR
4463 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR
4464 >452 1 bit ENABLE_SGPR_DISPATCH_ID
4465 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties*
4467 :ref:`amdgpu-processor-table`
4468 specifies *Architected flat
4469 scratch* then not supported
4471 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT
4473 457:455 3 bits Reserved, must be 0.
4474 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9
4475 Reserved, must be 0.
4478 wavefront size 64 mode.
4480 native wavefront size
4482 459 1 bit USES_DYNAMIC_STACK Indicates if the generated
4483 machine code is using a
4484 dynamically sized stack.
4485 This is only set in code
4486 object v5 and later.
4487 463:460 4 bits Reserved, must be 0.
4488 470:464 7 bits KERNARG_PRELOAD_SPEC_LENGTH GFX6-GFX9
4489 - Reserved, must be 0.
4491 - The number of dwords from
4492 the kernarg segment to preload
4493 into User SGPRs before kernel
4495 :ref:`amdgpu-amdhsa-kernarg-preload`).
4496 479:471 9 bits KERNARG_PRELOAD_SPEC_OFFSET GFX6-GFX9
4497 - Reserved, must be 0.
4499 - An offset in dwords into the
4500 kernarg segment to begin
4501 preloading data into User
4503 :ref:`amdgpu-amdhsa-kernarg-preload`).
4504 511:480 4 bytes Reserved, must be 0.
4505 512 **Total size 64 bytes.**
4506 ======= ====================================================================
4510 .. table:: compute_pgm_rsrc1 for GFX6-GFX12
4511 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table
4513 ======= ======= =============================== ===========================================================================
4514 Bits Size Field Name Description
4515 ======= ======= =============================== ===========================================================================
4516 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
4517 blocks used by each work-item;
4518 granularity is device
4523 - max(0, ceil(vgprs_used / 4) - 1)
4526 - vgprs_used = align(arch_vgprs, 4)
4528 - max(0, ceil(vgprs_used / 8) - 1)
4529 GFX10-GFX11 (wavefront size 64)
4531 - max(0, ceil(vgprs_used / 4) - 1)
4532 GFX10-GFX11 (wavefront size 32)
4534 - max(0, ceil(vgprs_used / 8) - 1)
4536 Where vgprs_used is defined
4537 as the highest VGPR number
4538 explicitly referenced plus
4541 Used by CP to set up
4542 ``COMPUTE_PGM_RSRC1.VGPRS``.
4545 :ref:`amdgpu-assembler`
4547 automatically for the
4548 selected processor from
4549 values provided to the
4550 `.amdhsa_kernel` directive
4552 `.amdhsa_next_free_vgpr`
4553 nested directive (see
4554 :ref:`amdhsa-kernel-directives-table`).
4555 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4556 blocks used by a wavefront;
4557 granularity is device
4562 - max(0, ceil(sgprs_used / 8) - 1)
4565 - 2 * max(0, ceil(sgprs_used / 16) - 1)
4567 Reserved, must be 0.
4572 defined as the highest
4573 SGPR number explicitly
4574 referenced plus one, plus
4575 a target specific number
4576 of additional special
4578 FLAT_SCRATCH (GFX7+) and
4579 XNACK_MASK (GFX8+), and
4582 limitations. It does not
4583 include the 16 SGPRs added
4584 if a trap handler is
4588 limitations and special
4589 SGPR layout are defined in
4591 documentation, which can
4593 :ref:`amdgpu-processors`
4596 Used by CP to set up
4597 ``COMPUTE_PGM_RSRC1.SGPRS``.
4600 :ref:`amdgpu-assembler`
4602 automatically for the
4603 selected processor from
4604 values provided to the
4605 `.amdhsa_kernel` directive
4607 `.amdhsa_next_free_sgpr`
4608 and `.amdhsa_reserve_*`
4609 nested directives (see
4610 :ref:`amdhsa-kernel-directives-table`).
4611 11:10 2 bits PRIORITY Must be 0.
4613 Start executing wavefront
4614 at the specified priority.
4616 CP is responsible for
4618 ``COMPUTE_PGM_RSRC1.PRIORITY``.
4619 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
4620 with specified rounding
4623 precision floating point
4626 Floating point rounding
4627 mode values are defined in
4628 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4630 Used by CP to set up
4631 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4632 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
4633 with specified rounding
4634 denorm mode for half/double (16
4635 and 64-bit) floating point
4636 precision floating point
4639 Floating point rounding
4640 mode values are defined in
4641 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4643 Used by CP to set up
4644 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4645 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
4646 with specified denorm mode
4649 precision floating point
4652 Floating point denorm mode
4653 values are defined in
4654 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4656 Used by CP to set up
4657 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4658 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
4659 with specified denorm mode
4661 and 64-bit) floating point
4662 precision floating point
4665 Floating point denorm mode
4666 values are defined in
4667 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4669 Used by CP to set up
4670 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4671 20 1 bit PRIV Must be 0.
4673 Start executing wavefront
4674 in privilege trap handler
4677 CP is responsible for
4679 ``COMPUTE_PGM_RSRC1.PRIV``.
4680 21 1 bit ENABLE_DX10_CLAMP GFX9-GFX11
4681 Wavefront starts execution
4682 with DX10 clamp mode
4683 enabled. Used by the vector
4684 ALU to force DX10 style
4685 treatment of NaN's (when
4686 set, clamp NaN to zero,
4690 Used by CP to set up
4691 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4693 If 1, wavefronts are scheduled
4694 in a round-robin fashion with
4695 respect to the other wavefronts
4696 of the SIMD. Otherwise, wavefronts
4697 are scheduled in oldest age order.
4699 CP is responsible for filling in
4700 ``COMPUTE_PGM_RSRC1.WG_RR_EN``.
4701 22 1 bit DEBUG_MODE Must be 0.
4703 Start executing wavefront
4704 in single step mode.
4706 CP is responsible for
4708 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4709 23 1 bit ENABLE_IEEE_MODE GFX9-GFX11
4710 Wavefront starts execution
4712 enabled. Floating point
4713 opcodes that support
4714 exception flag gathering
4715 will quiet and propagate
4716 signaling-NaN inputs per
4717 IEEE 754-2008. Min_dx10 and
4718 max_dx10 become IEEE
4719 754-2008 compliant due to
4720 signaling-NaN propagation
4723 Used by CP to set up
4724 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4726 Reserved. Must be 0.
4727 24 1 bit BULKY Must be 0.
4729 Only one work-group allowed
4730 to execute on a compute
4733 CP is responsible for
4735 ``COMPUTE_PGM_RSRC1.BULKY``.
4736 25 1 bit CDBG_USER Must be 0.
4738 Flag that can be used to
4739 control debugging code.
4741 CP is responsible for
4743 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4744 26 1 bit FP16_OVFL GFX6-GFX8
4745 Reserved, must be 0.
4747 Wavefront starts execution
4748 with specified fp16 overflow
4751 - If 0, fp16 overflow generates
4753 - If 1, fp16 overflow that is the
4754 result of an +/-INF input value
4755 or divide by 0 produces a +/-INF,
4756 otherwise clamps computed
4757 overflow to +/-MAX_FP16 as
4760 Used by CP to set up
4761 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4762 28:27 2 bits Reserved, must be 0.
4763 29 1 bit WGP_MODE GFX6-GFX9
4764 Reserved, must be 0.
4766 - If 0 execute work-groups in
4767 CU wavefront execution mode.
4768 - If 1 execute work-groups on
4769 in WGP wavefront execution mode.
4771 See :ref:`amdgpu-amdhsa-memory-model`.
4773 Used by CP to set up
4774 ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4775 30 1 bit MEM_ORDERED GFX6-GFX9
4776 Reserved, must be 0.
4778 Controls the behavior of the
4779 s_waitcnt's vmcnt and vscnt
4782 - If 0 vmcnt reports completion
4783 of load and atomic with return
4784 out of order with sample
4785 instructions, and the vscnt
4786 reports the completion of
4787 store and atomic without
4789 - If 1 vmcnt reports completion
4790 of load, atomic with return
4791 and sample instructions in
4792 order, and the vscnt reports
4793 the completion of store and
4794 atomic without return in order.
4796 Used by CP to set up
4797 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4798 31 1 bit FWD_PROGRESS GFX6-GFX9
4799 Reserved, must be 0.
4801 - If 0 execute SIMD wavefronts
4802 using oldest first policy.
4803 - If 1 execute SIMD wavefronts to
4804 ensure wavefronts will make some
4807 Used by CP to set up
4808 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4809 32 **Total size 4 bytes**
4810 ======= ===================================================================================================================
4814 .. table:: compute_pgm_rsrc2 for GFX6-GFX12
4815 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table
4817 ======= ======= =============================== ===========================================================================
4818 Bits Size Field Name Description
4819 ======= ======= =============================== ===========================================================================
4820 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the
4822 * If the *Target Properties*
4824 :ref:`amdgpu-processor-table`
4827 scratch* then enable the
4829 wavefront scratch offset
4830 system register (see
4831 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4832 * If the *Target Properties*
4834 :ref:`amdgpu-processor-table`
4835 specifies *Architected
4836 flat scratch* then enable
4838 FLAT_SCRATCH register
4840 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4842 Used by CP to set up
4843 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4844 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
4846 registers requested. This
4847 number must be greater than
4848 or equal to the number of user
4849 data registers enabled.
4851 Used by CP to set up
4852 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4853 6 1 bit ENABLE_TRAP_HANDLER GFX6-GFX11
4857 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4858 which is set by the CP if
4859 the runtime has installed a
4862 Reserved, must be 0.
4863 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
4864 system SGPR register for
4865 the work-group id in the X
4867 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4869 Used by CP to set up
4870 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4871 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
4872 system SGPR register for
4873 the work-group id in the Y
4875 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4877 Used by CP to set up
4878 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4879 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
4880 system SGPR register for
4881 the work-group id in the Z
4883 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4885 Used by CP to set up
4886 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4887 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
4888 system SGPR register for
4889 work-group information (see
4890 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4892 Used by CP to set up
4893 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4894 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
4895 VGPR system registers used
4896 for the work-item ID.
4897 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4900 Used by CP to set up
4901 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4902 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
4904 Wavefront starts execution
4906 exceptions enabled which
4907 are generated when L1 has
4908 witnessed a thread access
4912 CP is responsible for
4913 filling in the address
4915 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4916 according to what the
4918 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
4920 Wavefront starts execution
4921 with memory violation
4922 exceptions exceptions
4923 enabled which are generated
4924 when a memory violation has
4925 occurred for this wavefront from
4927 (write-to-read-only-memory,
4928 mis-aligned atomic, LDS
4929 address out of range,
4930 illegal address, etc.).
4934 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4935 according to what the
4937 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
4939 CP uses the rounded value
4940 from the dispatch packet,
4941 not this value, as the
4942 dispatch may contain
4943 dynamically allocated group
4944 segment memory. CP writes
4946 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4948 Amount of group segment
4949 (LDS) to allocate for each
4950 work-group. Granularity is
4954 roundup(lds-size / (64 * 4))
4956 roundup(lds-size / (128 * 4))
4958 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
4959 _INVALID_OPERATION with specified exceptions
4962 Used by CP to set up
4963 ``COMPUTE_PGM_RSRC2.EXCP_EN``
4964 (set from bits 0..6).
4968 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
4969 _SOURCE input operands is a
4971 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
4972 _DIVISION_BY_ZERO Zero
4973 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
4975 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
4977 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
4979 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
4980 _ZERO (rcp_iflag_f32 instruction
4982 31 1 bit RESERVED Reserved, must be 0.
4983 32 **Total size 4 bytes.**
4984 ======= ===================================================================================================================
4988 .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
4989 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4991 ======= ======= =============================== ===========================================================================
4992 Bits Size Field Name Description
4993 ======= ======= =============================== ===========================================================================
4994 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4.
4995 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4996 63 - accum-offset = 256.
4997 6:15 10 Reserved, must be 0.
4999 16 1 bit TG_SPLIT - If 0 the waves of a work-group are
5000 launched in the same CU.
5001 - If 1 the waves of a work-group can be
5002 launched in different CUs. The waves
5003 cannot use S_BARRIER or LDS.
5004 17:31 15 Reserved, must be 0.
5006 32 **Total size 4 bytes.**
5007 ======= ===================================================================================================================
5011 .. table:: compute_pgm_rsrc3 for GFX10-GFX11
5012 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table
5014 ======= ======= =============================== ===========================================================================
5015 Bits Size Field Name Description
5016 ======= ======= =============================== ===========================================================================
5017 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPR blocks when executing in subvector mode. For
5018 wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
5019 of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
5020 not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
5021 9:4 6 bits INST_PREF_SIZE GFX10
5022 Reserved, must be 0.
5024 Number of instruction bytes to prefetch, starting at the kernel's entry
5025 point instruction, before wavefront starts execution. The value is 0..63
5026 with a granularity of 128 bytes.
5027 10 1 bit TRAP_ON_START GFX10
5028 Reserved, must be 0.
5032 If 1, wavefront starts execution by trapping into the trap handler.
5034 CP is responsible for filling in the trap on start bit in
5035 ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
5037 11 1 bit TRAP_ON_END GFX10
5038 Reserved, must be 0.
5042 If 1, wavefront execution terminates by trapping into the trap handler.
5044 CP is responsible for filling in the trap on end bit in
5045 ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
5046 30:12 19 bits Reserved, must be 0.
5047 31 1 bit IMAGE_OP GFX10
5048 Reserved, must be 0.
5050 If 1, the kernel execution contains image instructions. If executed as
5051 part of a graphics pipeline, image read instructions will stall waiting
5052 for any necessary ``WAIT_SYNC`` fence to be performed in order to
5053 indicate that earlier pipeline stages have completed writing to the
5056 Not used for compute kernels that are not part of a graphics pipeline and
5058 32 **Total size 4 bytes.**
5059 ======= ===================================================================================================================
5063 .. table:: compute_pgm_rsrc3 for GFX12
5064 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx12-table
5066 ======= ======= =============================== ===========================================================================
5067 Bits Size Field Name Description
5068 ======= ======= =============================== ===========================================================================
5069 3:0 4 bits RESERVED Reserved, must be 0.
5070 11:4 8 bits INST_PREF_SIZE Number of instruction bytes to prefetch, starting at the kernel's entry
5071 point instruction, before wavefront starts execution. The value is 0..255
5072 with a granularity of 128 bytes.
5073 12 1 bit RESERVED Reserved, must be 0.
5074 13 1 bit GLG_EN If 1, group launch guarantee will be enabled for this dispatch
5075 30:14 17 bits RESERVED Reserved, must be 0.
5076 31 1 bit IMAGE_OP If 1, the kernel execution contains image instructions. If executed as
5077 part of a graphics pipeline, image read instructions will stall waiting
5078 for any necessary ``WAIT_SYNC`` fence to be performed in order to
5079 indicate that earlier pipeline stages have completed writing to the
5082 Not used for compute kernels that are not part of a graphics pipeline and
5084 32 **Total size 4 bytes.**
5085 ======= ===================================================================================================================
5089 .. table:: Floating Point Rounding Mode Enumeration Values
5090 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
5092 ====================================== ===== ==============================
5093 Enumeration Name Value Description
5094 ====================================== ===== ==============================
5095 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
5096 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
5097 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
5098 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
5099 ====================================== ===== ==============================
5102 .. table:: Extended FLT_ROUNDS Enumeration Values
5103 :name: amdgpu-rounding-mode-enumeration-values-table
5105 +------------------------+---------------+-------------------+--------------------+----------+
5106 | | F32 NEAR_EVEN | F32 PLUS_INFINITY | F32 MINUS_INFINITY | F32 ZERO |
5107 +------------------------+---------------+-------------------+--------------------+----------+
5108 | F64/F16 NEAR_EVEN | 1 | 11 | 14 | 17 |
5109 +------------------------+---------------+-------------------+--------------------+----------+
5110 | F64/F16 PLUS_INFINITY | 8 | 2 | 15 | 18 |
5111 +------------------------+---------------+-------------------+--------------------+----------+
5112 | F64/F16 MINUS_INFINITY | 9 | 12 | 3 | 19 |
5113 +------------------------+---------------+-------------------+--------------------+----------+
5114 | F64/F16 ZERO | 10 | 13 | 16 | 0 |
5115 +------------------------+---------------+-------------------+--------------------+----------+
5119 .. table:: Floating Point Denorm Mode Enumeration Values
5120 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
5122 ====================================== ===== ====================================
5123 Enumeration Name Value Description
5124 ====================================== ===== ====================================
5125 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination Denorms
5126 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
5127 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
5128 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
5129 ====================================== ===== ====================================
5131 Denormal flushing is sign respecting. i.e. the behavior expected by
5132 ``"denormal-fp-math"="preserve-sign"``. The behavior is undefined with
5133 ``"denormal-fp-math"="positive-zero"``
5137 .. table:: System VGPR Work-Item ID Enumeration Values
5138 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
5140 ======================================== ===== ============================
5141 Enumeration Name Value Description
5142 ======================================== ===== ============================
5143 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
5145 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
5147 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
5149 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
5150 ======================================== ===== ============================
5152 .. _amdgpu-amdhsa-initial-kernel-execution-state:
5154 Initial Kernel Execution State
5155 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5157 This section defines the register state that will be set up by the packet
5158 processor prior to the start of execution of every wavefront. This is limited by
5159 the constraints of the hardware controllers of CP/ADC/SPI.
5161 The order of the SGPR registers is defined, but the compiler can specify which
5162 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
5163 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
5164 for enabled registers are dense starting at SGPR0: the first enabled register is
5165 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
5168 The initial SGPRs comprise up to 16 User SGPRs that are set by CP and apply to
5169 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
5170 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
5171 actually initialized. These are then immediately followed by the System SGPRs
5172 that are set up by ADC/SPI and can have different values for each wavefront of
5175 SGPR register initial state is defined in
5176 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
5178 .. table:: SGPR Register Set Up Order
5179 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
5181 ========== ========================== ====== ==============================
5182 SGPR Order Name Number Description
5183 (kernel descriptor enable of
5185 ========== ========================== ====== ==============================
5186 First Private Segment Buffer 4 See
5187 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
5189 then Dispatch Ptr 2 64-bit address of AQL dispatch
5190 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
5192 then Queue Ptr 2 64-bit address of amd_queue_t
5193 (enable_sgpr_queue_ptr) object for AQL queue on which
5194 the dispatch packet was
5196 then Kernarg Segment Ptr 2 64-bit address of Kernarg
5197 (enable_sgpr_kernarg segment. This is directly
5198 _segment_ptr) copied from the
5199 kernarg_address in the kernel
5202 Having CP load it once avoids
5203 loading it at the beginning of
5205 then Dispatch Id 2 64-bit Dispatch ID of the
5206 (enable_sgpr_dispatch_id) dispatch packet being
5208 then Flat Scratch Init 2 See
5209 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5211 then Preloaded Kernargs N/A See
5212 (kernarg_preload_spec :ref:`amdgpu-amdhsa-kernarg-preload`.
5214 then Private Segment Size 1 The 32-bit byte size of a
5215 (enable_sgpr_private single work-item's memory
5216 _segment_size) allocation. This is the
5217 value from the kernel
5218 dispatch packet Private
5219 Segment Byte Size rounded up
5220 by CP to a multiple of
5223 Having CP load it once avoids
5224 loading it at the beginning of
5227 This is not used for
5228 GFX7-GFX8 since it is the same
5229 value as the second SGPR of
5230 Flat Scratch Init. However, it
5231 may be needed for GFX9-GFX11 which
5232 changes the meaning of the
5233 Flat Scratch Init value.
5234 then Work-Group Id X 1 32-bit work-group id in X
5235 (enable_sgpr_workgroup_id dimension of grid for
5237 then Work-Group Id Y 1 32-bit work-group id in Y
5238 (enable_sgpr_workgroup_id dimension of grid for
5240 then Work-Group Id Z 1 32-bit work-group id in Z
5241 (enable_sgpr_workgroup_id dimension of grid for
5243 then Work-Group Info 1 {first_wavefront, 14'b0000,
5244 (enable_sgpr_workgroup ordered_append_term[10:0],
5245 _info) threadgroup_size_in_wavefronts[5:0]}
5246 then Scratch Wavefront Offset 1 See
5247 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5248 _segment_wavefront_offset) and
5249 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
5250 ========== ========================== ====== ==============================
5252 The order of the VGPR registers is defined, but the compiler can specify which
5253 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
5254 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
5255 for enabled registers are dense starting at VGPR0: the first enabled register is
5256 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
5259 There are different methods used for the VGPR initial state:
5261 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
5262 specifies otherwise, a separate VGPR register is used per work-item ID. The
5263 VGPR register initial state for this method is defined in
5264 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
5265 * If *Target Properties* column of :ref:`amdgpu-processor-table`
5266 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
5267 for all work-item IDs. The register layout for this method is defined in
5268 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
5270 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
5271 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
5273 ========== ========================== ====== ==============================
5274 VGPR Order Name Number Description
5275 (kernel descriptor enable of
5277 ========== ========================== ====== ==============================
5278 First Work-Item Id X 1 32-bit work-item id in X
5279 (Always initialized) dimension of work-group for
5281 then Work-Item Id Y 1 32-bit work-item id in Y
5282 (enable_vgpr_workitem_id dimension of work-group for
5283 > 0) wavefront lane.
5284 then Work-Item Id Z 1 32-bit work-item id in Z
5285 (enable_vgpr_workitem_id dimension of work-group for
5286 > 1) wavefront lane.
5287 ========== ========================== ====== ==============================
5291 .. table:: Register Layout for Packed Work-Item ID Method
5292 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
5294 ======= ======= ================ =========================================
5295 Bits Size Field Name Description
5296 ======= ======= ================ =========================================
5297 0:9 10 bits Work-Item Id X Work-item id in X
5298 dimension of work-group for
5303 10:19 10 bits Work-Item Id Y Work-item id in Y
5304 dimension of work-group for
5307 Initialized if enable_vgpr_workitem_id >
5308 0, otherwise set to 0.
5309 20:29 10 bits Work-Item Id Z Work-item id in Z
5310 dimension of work-group for
5313 Initialized if enable_vgpr_workitem_id >
5314 1, otherwise set to 0.
5315 30:31 2 bits Reserved, set to 0.
5316 ======= ======= ================ =========================================
5318 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
5320 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
5322 2. Work-group Id registers X, Y, Z are set by ADC which supports any
5323 combination including none.
5324 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
5325 its value cannot be included with the flat scratch init value which is per
5326 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
5327 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
5329 5. Flat Scratch register pair initialization is described in
5330 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5332 The global segment can be accessed either using buffer instructions (GFX6 which
5333 has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
5334 instructions (GFX9-GFX11).
5336 If buffer operations are used, then the compiler can generate a V# with the
5337 following properties:
5341 * ATC: 1 if IOMMU present (such as APU)
5343 * MTYPE set to support memory coherence that matches the runtime (such as CC for
5344 APU and NC for dGPU).
5346 .. _amdgpu-amdhsa-kernarg-preload:
5348 Preloaded Kernel Arguments
5349 ++++++++++++++++++++++++++
5351 On hardware that supports this feature, kernel arguments can be preloaded into
5352 User SGPRs, up to the maximum number of User SGPRs available. The allocation of
5353 Preload SGPRs occurs directly after the last enabled non-kernarg preload User
5354 SGPR. (See :ref:`amdgpu-amdhsa-initial-kernel-execution-state`)
5356 The data preloaded is copied from the kernarg segment, the amount of data is
5357 determined by the value specified in the kernarg_preload_spec_length field of
5358 the kernel descriptor. This data is then loaded into consecutive User SGPRs. The
5359 number of SGPRs receiving preloaded kernarg data corresponds with the value
5360 given by kernarg_preload_spec_length. The preloading starts at the dword offset
5361 within the kernarg segment, which is specified by the
5362 kernarg_preload_spec_offset field.
5364 If the kernarg_preload_spec_length is non-zero, the CP firmware will append an
5365 additional 256 bytes to the kernel_code_entry_byte_offset. This addition
5366 facilitates the incorporation of a prologue to the kernel entry to handle cases
5367 where code designed for kernarg preloading is executed on hardware equipped with
5368 incompatible firmware. If hardware has compatible firmware the 256 bytes at the
5369 start of the kernel entry will be skipped.
5371 .. _amdgpu-amdhsa-kernel-prolog:
5376 The compiler performs initialization in the kernel prologue depending on the
5377 target and information about things like stack usage in the kernel and called
5378 functions. Some of this initialization requires the compiler to request certain
5379 User and System SGPRs be present in the
5380 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
5381 :ref:`amdgpu-amdhsa-kernel-descriptor`.
5383 .. _amdgpu-amdhsa-kernel-prolog-cfi:
5388 1. The CFI return address is undefined.
5390 2. The CFI CFA is defined using an expression which evaluates to a location
5391 description that comprises one memory location description for the
5392 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
5394 .. _amdgpu-amdhsa-kernel-prolog-m0:
5400 The M0 register must be initialized with a value at least the total LDS size
5401 if the kernel may access LDS via DS or flat operations. Total LDS size is
5402 available in dispatch packet. For M0, it is also possible to use maximum
5403 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
5406 The M0 register is not used for range checking LDS accesses and so does not
5407 need to be initialized in the prolog.
5409 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
5414 If the kernel has function calls it must set up the ABI stack pointer described
5415 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
5416 SGPR32 to the unswizzled scratch offset of the address past the last local
5419 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
5424 If the kernel needs a frame pointer for the reasons defined in
5425 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
5426 kernel prolog. If a frame pointer is not required then all uses of the frame
5427 pointer are replaced with immediate ``0`` offsets.
5429 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
5434 There are different methods used for initializing flat scratch:
5436 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5437 specifies *Does not support generic address space*:
5439 Flat scratch is not supported and there is no flat scratch register pair.
5441 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5442 specifies *Offset flat scratch*:
5444 If the kernel or any function it calls may use flat operations to access
5445 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
5446 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
5447 Scratch Wavefront Offset SGPR registers (see
5448 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
5450 1. The low word of Flat Scratch Init is the 32-bit byte offset from
5451 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
5452 being managed by SPI for the queue executing the kernel dispatch. This is
5453 the same value used in the Scratch Segment Buffer V# base address.
5455 CP obtains this from the runtime. (The Scratch Segment Buffer base address
5456 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
5458 The prolog must add the value of Scratch Wavefront Offset to get the
5459 wavefront's byte scratch backing memory offset from
5460 ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
5462 The Scratch Wavefront Offset must also be used as an offset with Private
5463 segment address when using the Scratch Segment Buffer.
5465 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
5466 shifted by 8 before moving into FLAT_SCRATCH_HI.
5468 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
5469 SGPRn is the highest numbered SGPR allocated to the wavefront).
5470 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
5471 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
5472 FLAT SCRATCH BASE in flat memory instructions that access the scratch
5474 2. The second word of Flat Scratch Init is 32-bit byte size of a single
5475 work-items scratch memory usage.
5477 CP obtains this from the runtime, and it is always a multiple of DWORD. CP
5478 checks that the value in the kernel dispatch packet Private Segment Byte
5479 Size is not larger and requests the runtime to increase the queue's scratch
5482 CP directly loads from the kernel dispatch packet Private Segment Byte Size
5483 field and rounds up to a multiple of DWORD. Having CP load it once avoids
5484 loading it at the beginning of every wavefront.
5486 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
5487 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
5488 in flat memory instructions.
5490 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5491 specifies *Absolute flat scratch*:
5493 If the kernel or any function it calls may use flat operations to access
5494 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
5495 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
5496 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
5497 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
5499 The Flat Scratch Init is the 64-bit address of the base of scratch backing
5500 memory being managed by SPI for the queue executing the kernel dispatch.
5502 CP obtains this from the runtime.
5504 The kernel prolog must add the value of the wave's Scratch Wavefront Offset
5505 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
5506 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
5507 memory instructions.
5509 The Scratch Wavefront Offset must also be used as an offset with Private
5510 segment address when using the Scratch Segment Buffer (see
5511 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
5513 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5514 specifies *Architected flat scratch*:
5516 If ENABLE_PRIVATE_SEGMENT is enabled in
5517 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table` then the FLAT_SCRATCH
5518 register pair will be initialized to the 64-bit address of the base of scratch
5519 backing memory being managed by SPI for the queue executing the kernel
5520 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
5521 flat scratch base in flat memory instructions.
5523 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
5525 Private Segment Buffer
5526 ++++++++++++++++++++++
5528 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
5529 *Architected flat scratch* then a Private Segment Buffer is not supported.
5530 Instead the flat SCRATCH instructions are used.
5532 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
5533 that are used as a V# to access scratch. CP uses the value provided by the
5534 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
5535 access the private memory space using a segment address. See
5536 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
5538 The scratch V# is a four-aligned SGPR and always selected for the kernel as
5541 - If it is known during instruction selection that there is stack usage,
5542 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
5543 optimizations are disabled (``-O0``), if stack objects already exist (for
5544 locals, etc.), or if there are any function calls.
5546 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
5547 are reserved for the tentative scratch V#. These will be used if it is
5548 determined that spilling is needed.
5550 - If no use is made of the tentative scratch V#, then it is unreserved,
5551 and the register count is determined ignoring it.
5552 - If use is made of the tentative scratch V#, then its register numbers
5553 are shifted to the first four-aligned SGPR index after the highest one
5554 allocated by the register allocator, and all uses are updated. The
5555 register count includes them in the shifted location.
5556 - In either case, if the processor has the SGPR allocation bug, the
5557 tentative allocation is not shifted or unreserved in order to ensure
5558 the register count is higher to workaround the bug.
5562 This approach of using a tentative scratch V# and shifting the register
5563 numbers if used avoids having to perform register allocation a second
5564 time if the tentative V# is eliminated. This is more efficient and
5565 avoids the problem that the second register allocation may perform
5566 spilling which will fail as there is no longer a scratch V#.
5568 When the kernel prolog code is being emitted it is known whether the scratch V#
5569 described above is actually used. If it is, the prolog code must set it up by
5570 copying the Private Segment Buffer to the scratch V# registers and then adding
5571 the Private Segment Wavefront Offset to the queue base address in the V#. The
5572 result is a V# with a base address pointing to the beginning of the wavefront
5573 scratch backing memory.
5575 The Private Segment Buffer is always requested, but the Private Segment
5576 Wavefront Offset is only requested if it is used (see
5577 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5579 .. _amdgpu-amdhsa-memory-model:
5584 This section describes the mapping of the LLVM memory model onto AMDGPU machine
5585 code (see :ref:`memmodel`).
5587 The AMDGPU backend supports the memory synchronization scopes specified in
5588 :ref:`amdgpu-memory-scopes`.
5590 The code sequences used to implement the memory model specify the order of
5591 instructions that a single thread must execute. The ``s_waitcnt`` and cache
5592 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
5593 to other memory instructions executed by the same thread. This allows them to be
5594 moved earlier or later which can allow them to be combined with other instances
5595 of the same instruction, or hoisted/sunk out of loops to improve performance.
5596 Only the instructions related to the memory model are given; additional
5597 ``s_waitcnt`` instructions are required to ensure registers are defined before
5598 being used. These may be able to be combined with the memory model ``s_waitcnt``
5599 instructions as described above.
5601 The AMDGPU backend supports the following memory models:
5603 HSA Memory Model [HSA]_
5604 The HSA memory model uses a single happens-before relation for all address
5605 spaces (see :ref:`amdgpu-address-spaces`).
5606 OpenCL Memory Model [OpenCL]_
5607 The OpenCL memory model which has separate happens-before relations for the
5608 global and local address spaces. Only a fence specifying both global and
5609 local address space, and seq_cst instructions join the relationships. Since
5610 the LLVM ``memfence`` instruction does not allow an address space to be
5611 specified the OpenCL fence has to conservatively assume both local and
5612 global address space was specified. However, optimizations can often be
5613 done to eliminate the additional ``s_waitcnt`` instructions when there are
5614 no intervening memory instructions which access the corresponding address
5615 space. The code sequences in the table indicate what can be omitted for the
5616 OpenCL memory. The target triple environment is used to determine if the
5617 source language is OpenCL (see :ref:`amdgpu-opencl`).
5619 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
5622 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
5623 termed vector memory operations.
5625 Private address space uses ``buffer_load/store`` using the scratch V#
5626 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
5627 is accessing the memory, atomic memory orderings are not meaningful, and all
5628 accesses are treated as non-atomic.
5630 Constant address space uses ``buffer/global_load`` instructions (or equivalent
5631 scalar memory instructions). Since the constant address space contents do not
5632 change during the execution of a kernel dispatch it is not legal to perform
5633 stores, and atomic memory orderings are not meaningful, and all accesses are
5634 treated as non-atomic.
5636 A memory synchronization scope wider than work-group is not meaningful for the
5637 group (LDS) address space and is treated as work-group.
5639 The memory model does not support the region address space which is treated as
5642 Acquire memory ordering is not meaningful on store atomic instructions and is
5643 treated as non-atomic.
5645 Release memory ordering is not meaningful on load atomic instructions and is
5646 treated a non-atomic.
5648 Acquire-release memory ordering is not meaningful on load or store atomic
5649 instructions and is treated as acquire and release respectively.
5651 The memory order also adds the single thread optimization constraints defined in
5653 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
5655 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
5656 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
5658 ============ ==============================================================
5659 LLVM Memory Optimization Constraints
5661 ============ ==============================================================
5664 acquire - If a load atomic/atomicrmw then no following load/load
5665 atomic/store/store atomic/atomicrmw/fence instruction can be
5666 moved before the acquire.
5667 - If a fence then same as load atomic, plus no preceding
5668 associated fence-paired-atomic can be moved after the fence.
5669 release - If a store atomic/atomicrmw then no preceding load/load
5670 atomic/store/store atomic/atomicrmw/fence instruction can be
5671 moved after the release.
5672 - If a fence then same as store atomic, plus no following
5673 associated fence-paired-atomic can be moved before the
5675 acq_rel Same constraints as both acquire and release.
5676 seq_cst - If a load atomic then same constraints as acquire, plus no
5677 preceding sequentially consistent load atomic/store
5678 atomic/atomicrmw/fence instruction can be moved after the
5680 - If a store atomic then the same constraints as release, plus
5681 no following sequentially consistent load atomic/store
5682 atomic/atomicrmw/fence instruction can be moved before the
5684 - If an atomicrmw/fence then same constraints as acq_rel.
5685 ============ ==============================================================
5687 The code sequences used to implement the memory model are defined in the
5690 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5691 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5692 * :ref:`amdgpu-amdhsa-memory-model-gfx942`
5693 * :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
5695 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5697 Memory Model GFX6-GFX9
5698 ++++++++++++++++++++++
5702 * Each agent has multiple shader arrays (SA).
5703 * Each SA has multiple compute units (CU).
5704 * Each CU has multiple SIMDs that execute wavefronts.
5705 * The wavefronts for a single work-group are executed in the same CU but may be
5706 executed by different SIMDs.
5707 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
5709 * All LDS operations of a CU are performed as wavefront wide operations in a
5710 global order and involve no caching. Completion is reported to a wavefront in
5712 * The LDS memory has multiple request queues shared by the SIMDs of a
5713 CU. Therefore, the LDS operations performed by different wavefronts of a
5714 work-group can be reordered relative to each other, which can result in
5715 reordering the visibility of vector memory operations with respect to LDS
5716 operations of other wavefronts in the same work-group. A ``s_waitcnt
5717 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5718 vector memory operations between wavefronts of a work-group, but not between
5719 operations performed by the same wavefront.
5720 * The vector memory operations are performed as wavefront wide operations and
5721 completion is reported to a wavefront in execution order. The exception is
5722 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5723 vector memory order if they access LDS memory, and out of LDS operation order
5724 if they access global memory.
5725 * The vector memory operations access a single vector L1 cache shared by all
5726 SIMDs a CU. Therefore, no special action is required for coherence between the
5727 lanes of a single wavefront, or for coherence between wavefronts in the same
5728 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5729 wavefronts executing in different work-groups as they may be executing on
5731 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
5732 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5733 scalar operations are used in a restricted way so do not impact the memory
5734 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5735 * The vector and scalar memory operations use an L2 cache shared by all CUs on
5737 * The L2 cache has independent channels to service disjoint ranges of virtual
5739 * Each CU has a separate request queue per channel. Therefore, the vector and
5740 scalar memory operations performed by wavefronts executing in different
5741 work-groups (which may be executing on different CUs) of an agent can be
5742 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
5743 ensure synchronization between vector memory operations of different CUs. It
5744 ensures a previous vector memory operation has completed before executing a
5745 subsequent vector memory or LDS operation and so can be used to meet the
5746 requirements of acquire and release.
5747 * The L2 cache can be kept coherent with other agents on some targets, or ranges
5748 of virtual addresses can be set up to bypass it to ensure system coherence.
5750 Scalar memory operations are only used to access memory that is proven to not
5751 change during the execution of the kernel dispatch. This includes constant
5752 address space and global address space for program scope ``const`` variables.
5753 Therefore, the kernel machine code does not have to maintain the scalar cache to
5754 ensure it is coherent with the vector caches. The scalar and vector caches are
5755 invalidated between kernel dispatches by CP since constant address space data
5756 may change between kernel dispatch executions. See
5757 :ref:`amdgpu-amdhsa-memory-spaces`.
5759 The one exception is if scalar writes are used to spill SGPR registers. In this
5760 case the AMDGPU backend ensures the memory location used to spill is never
5761 accessed by vector memory operations at the same time. If scalar writes are used
5762 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
5763 return since the locations may be used for vector memory instructions by a
5764 future wavefront that uses the same scratch area, or a function call that
5765 creates a frame at the same address, respectively. There is no need for a
5766 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
5768 For kernarg backing memory:
5770 * CP invalidates the L1 cache at the start of each kernel dispatch.
5771 * On dGPU the kernarg backing memory is allocated in host memory accessed as
5772 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
5773 causes it to be treated as non-volatile and so is not invalidated by
5775 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
5776 and so the L2 cache will be coherent with the CPU and other agents.
5778 Scratch backing memory (which is used for the private address space) is accessed
5779 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
5780 only accessed by a single thread, and is always write-before-read, there is
5781 never a need to invalidate these entries from the L1 cache. Hence all cache
5782 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
5784 The code sequences used to implement the memory model for GFX6-GFX9 are defined
5785 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
5787 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
5788 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
5790 ============ ============ ============== ========== ================================
5791 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
5792 Ordering Sync Scope Address GFX6-GFX9
5794 ============ ============ ============== ========== ================================
5796 ------------------------------------------------------------------------------------
5797 load *none* *none* - global - !volatile & !nontemporal
5799 - private 1. buffer/global/flat_load
5801 - !volatile & nontemporal
5803 1. buffer/global/flat_load
5808 1. buffer/global/flat_load
5810 2. s_waitcnt vmcnt(0)
5812 - Must happen before
5813 any following volatile
5824 load *none* *none* - local 1. ds_load
5825 store *none* *none* - global - !volatile & !nontemporal
5827 - private 1. buffer/global/flat_store
5829 - !volatile & nontemporal
5831 1. buffer/global/flat_store
5836 1. buffer/global/flat_store
5837 2. s_waitcnt vmcnt(0)
5839 - Must happen before
5840 any following volatile
5851 store *none* *none* - local 1. ds_store
5852 **Unordered Atomic**
5853 ------------------------------------------------------------------------------------
5854 load atomic unordered *any* *any* *Same as non-atomic*.
5855 store atomic unordered *any* *any* *Same as non-atomic*.
5856 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
5857 **Monotonic Atomic**
5858 ------------------------------------------------------------------------------------
5859 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load
5861 - workgroup - generic
5862 load atomic monotonic - agent - global 1. buffer/global/flat_load
5863 - system - generic glc=1
5864 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
5865 - wavefront - generic
5869 store atomic monotonic - singlethread - local 1. ds_store
5872 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
5873 - wavefront - generic
5877 atomicrmw monotonic - singlethread - local 1. ds_atomic
5881 ------------------------------------------------------------------------------------
5882 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
5885 load atomic acquire - workgroup - global 1. buffer/global_load
5886 load atomic acquire - workgroup - local 1. ds/flat_load
5887 - generic 2. s_waitcnt lgkmcnt(0)
5890 - Must happen before
5899 older than a local load
5903 load atomic acquire - agent - global 1. buffer/global_load
5905 2. s_waitcnt vmcnt(0)
5907 - Must happen before
5915 3. buffer_wbinvl1_vol
5917 - Must happen before
5927 load atomic acquire - agent - generic 1. flat_load glc=1
5928 - system 2. s_waitcnt vmcnt(0) &
5933 - Must happen before
5936 - Ensures the flat_load
5941 3. buffer_wbinvl1_vol
5943 - Must happen before
5953 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
5956 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
5957 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
5958 - generic 2. s_waitcnt lgkmcnt(0)
5961 - Must happen before
5974 atomicrmw acquire - agent - global 1. buffer/global_atomic
5975 - system 2. s_waitcnt vmcnt(0)
5977 - Must happen before
5986 3. buffer_wbinvl1_vol
5988 - Must happen before
5998 atomicrmw acquire - agent - generic 1. flat_atomic
5999 - system 2. s_waitcnt vmcnt(0) &
6004 - Must happen before
6013 3. buffer_wbinvl1_vol
6015 - Must happen before
6025 fence acquire - singlethread *none* *none*
6027 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6032 - However, since LLVM
6057 fence-paired-atomic).
6058 - Must happen before
6069 fence-paired-atomic.
6071 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
6078 - However, since LLVM
6086 - Could be split into
6095 - s_waitcnt vmcnt(0)
6106 fence-paired-atomic).
6107 - s_waitcnt lgkmcnt(0)
6118 fence-paired-atomic).
6119 - Must happen before
6133 fence-paired-atomic.
6135 2. buffer_wbinvl1_vol
6137 - Must happen before any
6138 following global/generic
6148 ------------------------------------------------------------------------------------
6149 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
6152 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
6161 - Must happen before
6172 2. buffer/global/flat_store
6173 store atomic release - workgroup - local 1. ds_store
6174 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
6175 - system - generic vmcnt(0)
6181 - Could be split into
6190 - s_waitcnt vmcnt(0)
6197 - s_waitcnt lgkmcnt(0)
6204 - Must happen before
6215 2. buffer/global/flat_store
6216 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
6219 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
6228 - Must happen before
6239 2. buffer/global/flat_atomic
6240 atomicrmw release - workgroup - local 1. ds_atomic
6241 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
6242 - system - generic vmcnt(0)
6246 - Could be split into
6255 - s_waitcnt vmcnt(0)
6262 - s_waitcnt lgkmcnt(0)
6269 - Must happen before
6280 2. buffer/global/flat_atomic
6281 fence release - singlethread *none* *none*
6283 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6288 - However, since LLVM
6309 - Must happen before
6318 fence-paired-atomic).
6325 fence-paired-atomic.
6327 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
6338 - However, since LLVM
6353 - Could be split into
6362 - s_waitcnt vmcnt(0)
6369 - s_waitcnt lgkmcnt(0)
6376 - Must happen before
6385 fence-paired-atomic).
6392 fence-paired-atomic.
6394 **Acquire-Release Atomic**
6395 ------------------------------------------------------------------------------------
6396 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
6399 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
6408 - Must happen before
6419 2. buffer/global_atomic
6421 atomicrmw acq_rel - workgroup - local 1. ds_atomic
6422 2. s_waitcnt lgkmcnt(0)
6425 - Must happen before
6434 older than the local load
6438 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
6447 - Must happen before
6459 3. s_waitcnt lgkmcnt(0)
6462 - Must happen before
6471 older than a local load
6475 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
6480 - Could be split into
6489 - s_waitcnt vmcnt(0)
6496 - s_waitcnt lgkmcnt(0)
6503 - Must happen before
6514 2. buffer/global_atomic
6515 3. s_waitcnt vmcnt(0)
6517 - Must happen before
6526 4. buffer_wbinvl1_vol
6528 - Must happen before
6538 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
6543 - Could be split into
6552 - s_waitcnt vmcnt(0)
6559 - s_waitcnt lgkmcnt(0)
6566 - Must happen before
6578 3. s_waitcnt vmcnt(0) &
6583 - Must happen before
6592 4. buffer_wbinvl1_vol
6594 - Must happen before
6604 fence acq_rel - singlethread *none* *none*
6606 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6626 - Must happen before
6649 acquire-fence-paired-atomic)
6670 release-fence-paired-atomic).
6675 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
6682 - However, since LLVM
6690 - Could be split into
6699 - s_waitcnt vmcnt(0)
6706 - s_waitcnt lgkmcnt(0)
6713 - Must happen before
6718 global/local/generic
6727 acquire-fence-paired-atomic)
6739 global/local/generic
6748 release-fence-paired-atomic).
6753 2. buffer_wbinvl1_vol
6755 - Must happen before
6769 **Sequential Consistent Atomic**
6770 ------------------------------------------------------------------------------------
6771 load atomic seq_cst - singlethread - global *Same as corresponding
6772 - wavefront - local load atomic acquire,
6773 - generic except must generate
6774 all instructions even
6776 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
6792 lgkmcnt(0) and so do
6824 order. The s_waitcnt
6825 could be placed after
6829 make the s_waitcnt be
6836 instructions same as
6839 except must generate
6840 all instructions even
6842 load atomic seq_cst - workgroup - local *Same as corresponding
6843 load atomic acquire,
6844 except must generate
6845 all instructions even
6848 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
6849 - system - generic vmcnt(0)
6851 - Could be split into
6860 - s_waitcnt lgkmcnt(0)
6873 lgkmcnt(0) and so do
6876 - s_waitcnt vmcnt(0)
6921 order. The s_waitcnt
6922 could be placed after
6926 make the s_waitcnt be
6933 instructions same as
6936 except must generate
6937 all instructions even
6939 store atomic seq_cst - singlethread - global *Same as corresponding
6940 - wavefront - local store atomic release,
6941 - workgroup - generic except must generate
6942 - agent all instructions even
6943 - system for OpenCL.*
6944 atomicrmw seq_cst - singlethread - global *Same as corresponding
6945 - wavefront - local atomicrmw acq_rel,
6946 - workgroup - generic except must generate
6947 - agent all instructions even
6948 - system for OpenCL.*
6949 fence seq_cst - singlethread *none* *Same as corresponding
6950 - wavefront fence acq_rel,
6951 - workgroup except must generate
6952 - agent all instructions even
6953 - system for OpenCL.*
6954 ============ ============ ============== ========== ================================
6956 .. _amdgpu-amdhsa-memory-model-gfx90a:
6963 * Each agent has multiple shader arrays (SA).
6964 * Each SA has multiple compute units (CU).
6965 * Each CU has multiple SIMDs that execute wavefronts.
6966 * The wavefronts for a single work-group are executed in the same CU but may be
6967 executed by different SIMDs. The exception is when in tgsplit execution mode
6968 when the wavefronts may be executed by different SIMDs in different CUs.
6969 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
6970 executing on it. The exception is when in tgsplit execution mode when no LDS
6971 is allocated as wavefronts of the same work-group can be in different CUs.
6972 * All LDS operations of a CU are performed as wavefront wide operations in a
6973 global order and involve no caching. Completion is reported to a wavefront in
6975 * The LDS memory has multiple request queues shared by the SIMDs of a
6976 CU. Therefore, the LDS operations performed by different wavefronts of a
6977 work-group can be reordered relative to each other, which can result in
6978 reordering the visibility of vector memory operations with respect to LDS
6979 operations of other wavefronts in the same work-group. A ``s_waitcnt
6980 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6981 vector memory operations between wavefronts of a work-group, but not between
6982 operations performed by the same wavefront.
6983 * The vector memory operations are performed as wavefront wide operations and
6984 completion is reported to a wavefront in execution order. The exception is
6985 that ``flat_load/store/atomic`` instructions can report out of vector memory
6986 order if they access LDS memory, and out of LDS operation order if they access
6988 * The vector memory operations access a single vector L1 cache shared by all
6989 SIMDs a CU. Therefore:
6991 * No special action is required for coherence between the lanes of a single
6994 * No special action is required for coherence between wavefronts in the same
6995 work-group since they execute on the same CU. The exception is when in
6996 tgsplit execution mode as wavefronts of the same work-group can be in
6997 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
7000 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
7001 executing in different work-groups as they may be executing on different
7004 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
7005 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
7006 scalar operations are used in a restricted way so do not impact the memory
7007 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
7008 * The vector and scalar memory operations use an L2 cache shared by all CUs on
7011 * The L2 cache has independent channels to service disjoint ranges of virtual
7013 * Each CU has a separate request queue per channel. Therefore, the vector and
7014 scalar memory operations performed by wavefronts executing in different
7015 work-groups (which may be executing on different CUs), or the same
7016 work-group if executing in tgsplit mode, of an agent can be reordered
7017 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
7018 synchronization between vector memory operations of different CUs. It
7019 ensures a previous vector memory operation has completed before executing a
7020 subsequent vector memory or LDS operation and so can be used to meet the
7021 requirements of acquire and release.
7022 * The L2 cache of one agent can be kept coherent with other agents by:
7023 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
7024 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
7025 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
7027 * Any local memory cache lines will be automatically invalidated by writes
7028 from CUs associated with other L2 caches, or writes from the CPU, due to
7029 the cache probe caused by coherent requests. Coherent requests are caused
7030 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
7031 XGMI, and by PCIe requests that are configured to be coherent requests.
7032 * XGMI accesses from the CPU to local memory may be cached on the CPU.
7033 Subsequent access from the GPU will automatically invalidate or writeback
7034 the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
7035 * Since all work-groups on the same agent share the same L2, no L2
7036 invalidation or writeback is required for coherence.
7037 * To ensure coherence of local and remote memory writes of work-groups in
7038 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
7039 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
7040 ()used for remote coarse grain memory). Note that MTYPE CC (used for local
7041 fine grain memory) causes write through to DRAM, and MTYPE UC (used for
7042 remote fine grain memory) bypasses the L2, so both will never result in
7043 dirty L2 cache lines.
7044 * To ensure coherence of local and remote memory reads of work-groups in
7045 different agents a ``buffer_invl2`` is required. It will invalidate L2
7046 cache lines with MTYPE NC (used for remote coarse grain memory). Note that
7047 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
7048 coarse memory) cause local reads to be invalidated by remote writes with
7049 with the PTE C-bit so these cache lines are not invalidated. Note that
7050 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
7051 never result in L2 cache lines that need to be invalidated.
7053 * PCIe access from the GPU to the CPU memory is kept coherent by using the
7054 MTYPE UC (uncached) which bypasses the L2.
7056 Scalar memory operations are only used to access memory that is proven to not
7057 change during the execution of the kernel dispatch. This includes constant
7058 address space and global address space for program scope ``const`` variables.
7059 Therefore, the kernel machine code does not have to maintain the scalar cache to
7060 ensure it is coherent with the vector caches. The scalar and vector caches are
7061 invalidated between kernel dispatches by CP since constant address space data
7062 may change between kernel dispatch executions. See
7063 :ref:`amdgpu-amdhsa-memory-spaces`.
7065 The one exception is if scalar writes are used to spill SGPR registers. In this
7066 case the AMDGPU backend ensures the memory location used to spill is never
7067 accessed by vector memory operations at the same time. If scalar writes are used
7068 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
7069 return since the locations may be used for vector memory instructions by a
7070 future wavefront that uses the same scratch area, or a function call that
7071 creates a frame at the same address, respectively. There is no need for a
7072 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
7074 For kernarg backing memory:
7076 * CP invalidates the L1 cache at the start of each kernel dispatch.
7077 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
7078 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
7079 cache. This also causes it to be treated as non-volatile and so is not
7080 invalidated by ``*_vol``.
7081 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
7082 so the L2 cache will be coherent with the CPU and other agents.
7084 Scratch backing memory (which is used for the private address space) is accessed
7085 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
7086 only accessed by a single thread, and is always write-before-read, there is
7087 never a need to invalidate these entries from the L1 cache. Hence all cache
7088 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
7090 The code sequences used to implement the memory model for GFX90A are defined
7091 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
7093 .. table:: AMDHSA Memory Model Code Sequences GFX90A
7094 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
7096 ============ ============ ============== ========== ================================
7097 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
7098 Ordering Sync Scope Address GFX90A
7100 ============ ============ ============== ========== ================================
7102 ------------------------------------------------------------------------------------
7103 load *none* *none* - global - !volatile & !nontemporal
7105 - private 1. buffer/global/flat_load
7107 - !volatile & nontemporal
7109 1. buffer/global/flat_load
7114 1. buffer/global/flat_load
7116 2. s_waitcnt vmcnt(0)
7118 - Must happen before
7119 any following volatile
7130 load *none* *none* - local 1. ds_load
7131 store *none* *none* - global - !volatile & !nontemporal
7133 - private 1. buffer/global/flat_store
7135 - !volatile & nontemporal
7137 1. buffer/global/flat_store
7142 1. buffer/global/flat_store
7143 2. s_waitcnt vmcnt(0)
7145 - Must happen before
7146 any following volatile
7157 store *none* *none* - local 1. ds_store
7158 **Unordered Atomic**
7159 ------------------------------------------------------------------------------------
7160 load atomic unordered *any* *any* *Same as non-atomic*.
7161 store atomic unordered *any* *any* *Same as non-atomic*.
7162 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
7163 **Monotonic Atomic**
7164 ------------------------------------------------------------------------------------
7165 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
7166 - wavefront - generic
7167 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
7170 - If not TgSplit execution
7173 load atomic monotonic - singlethread - local *If TgSplit execution mode,
7174 - wavefront local address space cannot
7175 - workgroup be used.*
7178 load atomic monotonic - agent - global 1. buffer/global/flat_load
7180 load atomic monotonic - system - global 1. buffer/global/flat_load
7182 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
7183 - wavefront - generic
7186 store atomic monotonic - system - global 1. buffer/global/flat_store
7188 store atomic monotonic - singlethread - local *If TgSplit execution mode,
7189 - wavefront local address space cannot
7190 - workgroup be used.*
7193 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
7194 - wavefront - generic
7197 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
7199 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
7200 - wavefront local address space cannot
7201 - workgroup be used.*
7205 ------------------------------------------------------------------------------------
7206 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
7209 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
7211 - If not TgSplit execution
7214 2. s_waitcnt vmcnt(0)
7216 - If not TgSplit execution
7218 - Must happen before the
7219 following buffer_wbinvl1_vol.
7221 3. buffer_wbinvl1_vol
7223 - If not TgSplit execution
7225 - Must happen before
7236 load atomic acquire - workgroup - local *If TgSplit execution mode,
7237 local address space cannot
7241 2. s_waitcnt lgkmcnt(0)
7244 - Must happen before
7253 older than the local load
7257 load atomic acquire - workgroup - generic 1. flat_load glc=1
7259 - If not TgSplit execution
7262 2. s_waitcnt lgkm/vmcnt(0)
7264 - Use lgkmcnt(0) if not
7265 TgSplit execution mode
7266 and vmcnt(0) if TgSplit
7268 - If OpenCL, omit lgkmcnt(0).
7269 - Must happen before
7271 buffer_wbinvl1_vol and any
7272 following global/generic
7279 older than a local load
7283 3. buffer_wbinvl1_vol
7285 - If not TgSplit execution
7292 load atomic acquire - agent - global 1. buffer/global_load
7294 2. s_waitcnt vmcnt(0)
7296 - Must happen before
7304 3. buffer_wbinvl1_vol
7306 - Must happen before
7316 load atomic acquire - system - global 1. buffer/global/flat_load
7318 2. s_waitcnt vmcnt(0)
7320 - Must happen before
7321 following buffer_invl2 and
7331 - Must happen before
7339 stale L1 global data,
7340 nor see stale L2 MTYPE
7342 MTYPE RW and CC memory will
7343 never be stale in L2 due to
7346 load atomic acquire - agent - generic 1. flat_load glc=1
7347 2. s_waitcnt vmcnt(0) &
7350 - If TgSplit execution mode,
7354 - Must happen before
7357 - Ensures the flat_load
7362 3. buffer_wbinvl1_vol
7364 - Must happen before
7374 load atomic acquire - system - generic 1. flat_load glc=1
7375 2. s_waitcnt vmcnt(0) &
7378 - If TgSplit execution mode,
7382 - Must happen before
7386 - Ensures the flat_load
7394 - Must happen before
7402 stale L1 global data,
7403 nor see stale L2 MTYPE
7405 MTYPE RW and CC memory will
7406 never be stale in L2 due to
7409 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
7410 - wavefront - generic
7411 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
7412 - wavefront local address space cannot
7416 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
7417 2. s_waitcnt vmcnt(0)
7419 - If not TgSplit execution
7421 - Must happen before the
7422 following buffer_wbinvl1_vol.
7423 - Ensures the atomicrmw
7428 3. buffer_wbinvl1_vol
7430 - If not TgSplit execution
7432 - Must happen before
7442 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
7443 local address space cannot
7447 2. s_waitcnt lgkmcnt(0)
7450 - Must happen before
7459 older than the local
7463 atomicrmw acquire - workgroup - generic 1. flat_atomic
7464 2. s_waitcnt lgkm/vmcnt(0)
7466 - Use lgkmcnt(0) if not
7467 TgSplit execution mode
7468 and vmcnt(0) if TgSplit
7470 - If OpenCL, omit lgkmcnt(0).
7471 - Must happen before
7473 buffer_wbinvl1_vol and
7486 3. buffer_wbinvl1_vol
7488 - If not TgSplit execution
7495 atomicrmw acquire - agent - global 1. buffer/global_atomic
7496 2. s_waitcnt vmcnt(0)
7498 - Must happen before
7507 3. buffer_wbinvl1_vol
7509 - Must happen before
7519 atomicrmw acquire - system - global 1. buffer/global_atomic
7520 2. s_waitcnt vmcnt(0)
7522 - Must happen before
7523 following buffer_invl2 and
7534 - Must happen before
7542 stale L1 global data,
7543 nor see stale L2 MTYPE
7545 MTYPE RW and CC memory will
7546 never be stale in L2 due to
7549 atomicrmw acquire - agent - generic 1. flat_atomic
7550 2. s_waitcnt vmcnt(0) &
7553 - If TgSplit execution mode,
7557 - Must happen before
7566 3. buffer_wbinvl1_vol
7568 - Must happen before
7578 atomicrmw acquire - system - generic 1. flat_atomic
7579 2. s_waitcnt vmcnt(0) &
7582 - If TgSplit execution mode,
7586 - Must happen before
7599 - Must happen before
7607 stale L1 global data,
7608 nor see stale L2 MTYPE
7610 MTYPE RW and CC memory will
7611 never be stale in L2 due to
7614 fence acquire - singlethread *none* *none*
7616 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7618 - Use lgkmcnt(0) if not
7619 TgSplit execution mode
7620 and vmcnt(0) if TgSplit
7630 - However, since LLVM
7645 - s_waitcnt vmcnt(0)
7657 fence-paired-atomic).
7658 - s_waitcnt lgkmcnt(0)
7669 fence-paired-atomic).
7670 - Must happen before
7672 buffer_wbinvl1_vol and
7683 fence-paired-atomic.
7685 2. buffer_wbinvl1_vol
7687 - If not TgSplit execution
7694 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
7697 - If TgSplit execution mode,
7703 - However, since LLVM
7711 - Could be split into
7720 - s_waitcnt vmcnt(0)
7731 fence-paired-atomic).
7732 - s_waitcnt lgkmcnt(0)
7743 fence-paired-atomic).
7744 - Must happen before
7758 fence-paired-atomic.
7760 2. buffer_wbinvl1_vol
7762 - Must happen before any
7763 following global/generic
7772 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
7775 - If TgSplit execution mode,
7781 - However, since LLVM
7789 - Could be split into
7798 - s_waitcnt vmcnt(0)
7809 fence-paired-atomic).
7810 - s_waitcnt lgkmcnt(0)
7821 fence-paired-atomic).
7822 - Must happen before
7823 the following buffer_invl2 and
7836 fence-paired-atomic.
7841 - Must happen before any
7842 following global/generic
7849 stale L1 global data,
7850 nor see stale L2 MTYPE
7852 MTYPE RW and CC memory will
7853 never be stale in L2 due to
7856 ------------------------------------------------------------------------------------
7857 store atomic release - singlethread - global 1. buffer/global/flat_store
7858 - wavefront - generic
7859 store atomic release - singlethread - local *If TgSplit execution mode,
7860 - wavefront local address space cannot
7864 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7866 - Use lgkmcnt(0) if not
7867 TgSplit execution mode
7868 and vmcnt(0) if TgSplit
7870 - If OpenCL, omit lgkmcnt(0).
7871 - s_waitcnt vmcnt(0)
7874 global/generic load/store/
7875 load atomic/store atomic/
7877 - s_waitcnt lgkmcnt(0)
7884 - Must happen before
7895 2. buffer/global/flat_store
7896 store atomic release - workgroup - local *If TgSplit execution mode,
7897 local address space cannot
7901 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
7904 - If TgSplit execution mode,
7910 - Could be split into
7919 - s_waitcnt vmcnt(0)
7926 - s_waitcnt lgkmcnt(0)
7933 - Must happen before
7944 2. buffer/global/flat_store
7945 store atomic release - system - global 1. buffer_wbl2
7947 - Must happen before
7948 following s_waitcnt.
7949 - Performs L2 writeback to
7953 visible at system scope.
7955 2. s_waitcnt lgkmcnt(0) &
7958 - If TgSplit execution mode,
7964 - Could be split into
7973 - s_waitcnt vmcnt(0)
7974 must happen after any
7980 - s_waitcnt lgkmcnt(0)
7981 must happen after any
7987 - Must happen before
7992 to memory and the L2
7999 3. buffer/global/flat_store
8000 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
8001 - wavefront - generic
8002 atomicrmw release - singlethread - local *If TgSplit execution mode,
8003 - wavefront local address space cannot
8007 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8009 - Use lgkmcnt(0) if not
8010 TgSplit execution mode
8011 and vmcnt(0) if TgSplit
8015 - s_waitcnt vmcnt(0)
8018 global/generic load/store/
8019 load atomic/store atomic/
8021 - s_waitcnt lgkmcnt(0)
8028 - Must happen before
8039 2. buffer/global/flat_atomic
8040 atomicrmw release - workgroup - local *If TgSplit execution mode,
8041 local address space cannot
8045 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
8048 - If TgSplit execution mode,
8052 - Could be split into
8061 - s_waitcnt vmcnt(0)
8068 - s_waitcnt lgkmcnt(0)
8075 - Must happen before
8086 2. buffer/global/flat_atomic
8087 atomicrmw release - system - global 1. buffer_wbl2
8089 - Must happen before
8090 following s_waitcnt.
8091 - Performs L2 writeback to
8095 visible at system scope.
8097 2. s_waitcnt lgkmcnt(0) &
8100 - If TgSplit execution mode,
8104 - Could be split into
8113 - s_waitcnt vmcnt(0)
8120 - s_waitcnt lgkmcnt(0)
8127 - Must happen before
8132 to memory and the L2
8139 3. buffer/global/flat_atomic
8140 fence release - singlethread *none* *none*
8142 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
8144 - Use lgkmcnt(0) if not
8145 TgSplit execution mode
8146 and vmcnt(0) if TgSplit
8156 - However, since LLVM
8171 - s_waitcnt vmcnt(0)
8176 load atomic/store atomic/
8178 - s_waitcnt lgkmcnt(0)
8185 - Must happen before
8194 fence-paired-atomic).
8201 fence-paired-atomic.
8203 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
8206 - If TgSplit execution mode,
8216 - However, since LLVM
8231 - Could be split into
8240 - s_waitcnt vmcnt(0)
8247 - s_waitcnt lgkmcnt(0)
8254 - Must happen before
8263 fence-paired-atomic).
8270 fence-paired-atomic.
8272 fence release - system *none* 1. buffer_wbl2
8277 - Must happen before
8278 following s_waitcnt.
8279 - Performs L2 writeback to
8283 visible at system scope.
8285 2. s_waitcnt lgkmcnt(0) &
8288 - If TgSplit execution mode,
8298 - However, since LLVM
8313 - Could be split into
8322 - s_waitcnt vmcnt(0)
8329 - s_waitcnt lgkmcnt(0)
8336 - Must happen before
8345 fence-paired-atomic).
8352 fence-paired-atomic.
8354 **Acquire-Release Atomic**
8355 ------------------------------------------------------------------------------------
8356 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
8357 - wavefront - generic
8358 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
8359 - wavefront local address space cannot
8363 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8365 - Use lgkmcnt(0) if not
8366 TgSplit execution mode
8367 and vmcnt(0) if TgSplit
8377 - s_waitcnt vmcnt(0)
8380 global/generic load/store/
8381 load atomic/store atomic/
8383 - s_waitcnt lgkmcnt(0)
8390 - Must happen before
8401 2. buffer/global_atomic
8402 3. s_waitcnt vmcnt(0)
8404 - If not TgSplit execution
8406 - Must happen before
8416 4. buffer_wbinvl1_vol
8418 - If not TgSplit execution
8425 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
8426 local address space cannot
8430 2. s_waitcnt lgkmcnt(0)
8433 - Must happen before
8442 older than the local load
8446 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
8448 - Use lgkmcnt(0) if not
8449 TgSplit execution mode
8450 and vmcnt(0) if TgSplit
8454 - s_waitcnt vmcnt(0)
8457 global/generic load/store/
8458 load atomic/store atomic/
8460 - s_waitcnt lgkmcnt(0)
8467 - Must happen before
8479 3. s_waitcnt lgkmcnt(0) &
8482 - If not TgSplit execution
8483 mode, omit vmcnt(0).
8486 - Must happen before
8488 buffer_wbinvl1_vol and
8497 older than a local load
8501 3. buffer_wbinvl1_vol
8503 - If not TgSplit execution
8510 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
8513 - If TgSplit execution mode,
8517 - Could be split into
8526 - s_waitcnt vmcnt(0)
8533 - s_waitcnt lgkmcnt(0)
8540 - Must happen before
8551 2. buffer/global_atomic
8552 3. s_waitcnt vmcnt(0)
8554 - Must happen before
8563 4. buffer_wbinvl1_vol
8565 - Must happen before
8575 atomicrmw acq_rel - system - global 1. buffer_wbl2
8577 - Must happen before
8578 following s_waitcnt.
8579 - Performs L2 writeback to
8583 visible at system scope.
8585 2. s_waitcnt lgkmcnt(0) &
8588 - If TgSplit execution mode,
8592 - Could be split into
8601 - s_waitcnt vmcnt(0)
8608 - s_waitcnt lgkmcnt(0)
8615 - Must happen before
8620 to global and L2 writeback
8621 have completed before
8626 3. buffer/global_atomic
8627 4. s_waitcnt vmcnt(0)
8629 - Must happen before
8630 following buffer_invl2 and
8641 - Must happen before
8649 stale L1 global data,
8650 nor see stale L2 MTYPE
8652 MTYPE RW and CC memory will
8653 never be stale in L2 due to
8656 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
8659 - If TgSplit execution mode,
8663 - Could be split into
8672 - s_waitcnt vmcnt(0)
8679 - s_waitcnt lgkmcnt(0)
8686 - Must happen before
8698 3. s_waitcnt vmcnt(0) &
8701 - If TgSplit execution mode,
8705 - Must happen before
8714 4. buffer_wbinvl1_vol
8716 - Must happen before
8726 atomicrmw acq_rel - system - generic 1. buffer_wbl2
8728 - Must happen before
8729 following s_waitcnt.
8730 - Performs L2 writeback to
8734 visible at system scope.
8736 2. s_waitcnt lgkmcnt(0) &
8739 - If TgSplit execution mode,
8743 - Could be split into
8752 - s_waitcnt vmcnt(0)
8759 - s_waitcnt lgkmcnt(0)
8766 - Must happen before
8771 to global and L2 writeback
8772 have completed before
8778 4. s_waitcnt vmcnt(0) &
8781 - If TgSplit execution mode,
8785 - Must happen before
8786 following buffer_invl2 and
8797 - Must happen before
8805 stale L1 global data,
8806 nor see stale L2 MTYPE
8808 MTYPE RW and CC memory will
8809 never be stale in L2 due to
8812 fence acq_rel - singlethread *none* *none*
8814 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
8816 - Use lgkmcnt(0) if not
8817 TgSplit execution mode
8818 and vmcnt(0) if TgSplit
8837 - s_waitcnt vmcnt(0)
8842 load atomic/store atomic/
8844 - s_waitcnt lgkmcnt(0)
8851 - Must happen before
8874 acquire-fence-paired-atomic)
8895 release-fence-paired-atomic).
8899 - Must happen before
8903 acquire-fence-paired
8904 atomic has completed
8913 acquire-fence-paired-atomic.
8915 2. buffer_wbinvl1_vol
8917 - If not TgSplit execution
8924 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
8927 - If TgSplit execution mode,
8933 - However, since LLVM
8941 - Could be split into
8950 - s_waitcnt vmcnt(0)
8957 - s_waitcnt lgkmcnt(0)
8964 - Must happen before
8969 global/local/generic
8978 acquire-fence-paired-atomic)
8990 global/local/generic
8999 release-fence-paired-atomic).
9004 2. buffer_wbinvl1_vol
9006 - Must happen before
9020 fence acq_rel - system *none* 1. buffer_wbl2
9025 - Must happen before
9026 following s_waitcnt.
9027 - Performs L2 writeback to
9031 visible at system scope.
9033 2. s_waitcnt lgkmcnt(0) &
9036 - If TgSplit execution mode,
9042 - However, since LLVM
9050 - Could be split into
9059 - s_waitcnt vmcnt(0)
9066 - s_waitcnt lgkmcnt(0)
9073 - Must happen before
9074 the following buffer_invl2 and
9078 global/local/generic
9087 acquire-fence-paired-atomic)
9099 global/local/generic
9108 release-fence-paired-atomic).
9116 - Must happen before
9125 stale L1 global data,
9126 nor see stale L2 MTYPE
9128 MTYPE RW and CC memory will
9129 never be stale in L2 due to
9132 **Sequential Consistent Atomic**
9133 ------------------------------------------------------------------------------------
9134 load atomic seq_cst - singlethread - global *Same as corresponding
9135 - wavefront - local load atomic acquire,
9136 - generic except must generate
9137 all instructions even
9139 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
9141 - Use lgkmcnt(0) if not
9142 TgSplit execution mode
9143 and vmcnt(0) if TgSplit
9145 - s_waitcnt lgkmcnt(0) must
9158 lgkmcnt(0) and so do
9161 - s_waitcnt vmcnt(0)
9180 consistent global/local
9206 order. The s_waitcnt
9207 could be placed after
9211 make the s_waitcnt be
9218 instructions same as
9221 except must generate
9222 all instructions even
9224 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
9225 local address space cannot
9228 *Same as corresponding
9229 load atomic acquire,
9230 except must generate
9231 all instructions even
9234 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
9235 - system - generic vmcnt(0)
9237 - If TgSplit execution mode,
9239 - Could be split into
9248 - s_waitcnt lgkmcnt(0)
9261 lgkmcnt(0) and so do
9264 - s_waitcnt vmcnt(0)
9309 order. The s_waitcnt
9310 could be placed after
9314 make the s_waitcnt be
9321 instructions same as
9324 except must generate
9325 all instructions even
9327 store atomic seq_cst - singlethread - global *Same as corresponding
9328 - wavefront - local store atomic release,
9329 - workgroup - generic except must generate
9330 - agent all instructions even
9331 - system for OpenCL.*
9332 atomicrmw seq_cst - singlethread - global *Same as corresponding
9333 - wavefront - local atomicrmw acq_rel,
9334 - workgroup - generic except must generate
9335 - agent all instructions even
9336 - system for OpenCL.*
9337 fence seq_cst - singlethread *none* *Same as corresponding
9338 - wavefront fence acq_rel,
9339 - workgroup except must generate
9340 - agent all instructions even
9341 - system for OpenCL.*
9342 ============ ============ ============== ========== ================================
9344 .. _amdgpu-amdhsa-memory-model-gfx942:
9351 * Each agent has multiple shader arrays (SA).
9352 * Each SA has multiple compute units (CU).
9353 * Each CU has multiple SIMDs that execute wavefronts.
9354 * The wavefronts for a single work-group are executed in the same CU but may be
9355 executed by different SIMDs. The exception is when in tgsplit execution mode
9356 when the wavefronts may be executed by different SIMDs in different CUs.
9357 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
9358 executing on it. The exception is when in tgsplit execution mode when no LDS
9359 is allocated as wavefronts of the same work-group can be in different CUs.
9360 * All LDS operations of a CU are performed as wavefront wide operations in a
9361 global order and involve no caching. Completion is reported to a wavefront in
9363 * The LDS memory has multiple request queues shared by the SIMDs of a
9364 CU. Therefore, the LDS operations performed by different wavefronts of a
9365 work-group can be reordered relative to each other, which can result in
9366 reordering the visibility of vector memory operations with respect to LDS
9367 operations of other wavefronts in the same work-group. A ``s_waitcnt
9368 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
9369 vector memory operations between wavefronts of a work-group, but not between
9370 operations performed by the same wavefront.
9371 * The vector memory operations are performed as wavefront wide operations and
9372 completion is reported to a wavefront in execution order. The exception is
9373 that ``flat_load/store/atomic`` instructions can report out of vector memory
9374 order if they access LDS memory, and out of LDS operation order if they access
9376 * The vector memory operations access a single vector L1 cache shared by all
9377 SIMDs a CU. Therefore:
9379 * No special action is required for coherence between the lanes of a single
9382 * No special action is required for coherence between wavefronts in the same
9383 work-group since they execute on the same CU. The exception is when in
9384 tgsplit execution mode as wavefronts of the same work-group can be in
9385 different CUs and so a ``buffer_inv sc0`` is required which will invalidate
9388 * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
9389 between wavefronts executing in different work-groups as they may be
9390 executing on different CUs.
9392 * Atomic read-modify-write instructions implicitly bypass the L1 cache.
9393 Therefore, they do not use the sc0 bit for coherence and instead use it to
9394 indicate if the instruction returns the original value being updated. They
9395 do use sc1 to indicate system or agent scope coherence.
9397 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
9398 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
9399 scalar operations are used in a restricted way so do not impact the memory
9400 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
9401 * The vector and scalar memory operations use an L2 cache.
9403 * The gfx942 can be configured as a number of smaller agents with each having
9404 a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
9405 larger agents with groups of CUs on each agent each sharing separate L2
9407 * The L2 cache has independent channels to service disjoint ranges of virtual
9409 * Each CU has a separate request queue per channel for its associated L2.
9410 Therefore, the vector and scalar memory operations performed by wavefronts
9411 executing with different L1 caches and the same L2 cache can be reordered
9412 relative to each other.
9413 * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
9414 vector memory operations of different CUs. It ensures a previous vector
9415 memory operation has completed before executing a subsequent vector memory
9416 or LDS operation and so can be used to meet the requirements of acquire and
9418 * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
9419 (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
9420 the PTE C-bit set for memory not local to the L2.
9422 * Any local memory cache lines will be automatically invalidated by writes
9423 from CUs associated with other L2 caches, or writes from the CPU, due to
9424 the cache probe caused by the PTE C-bit.
9425 * XGMI accesses from the CPU to local memory may be cached on the CPU.
9426 Subsequent access from the GPU will automatically invalidate or writeback
9427 the CPU cache due to the L2 probe filter.
9428 * To ensure coherence of local memory writes of CUs with different L1 caches
9429 in the same agent a ``buffer_wbl2`` is required. It does nothing if the
9430 agent is configured to have a single L2, or will writeback dirty L2 cache
9431 lines if configured to have multiple L2 caches.
9432 * To ensure coherence of local memory writes of CUs in different agents a
9433 ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
9434 * To ensure coherence of local memory reads of CUs with different L1 caches
9435 in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
9436 agent is configured to have a single L2, or will invalidate non-local L2
9437 cache lines if configured to have multiple L2 caches.
9438 * To ensure coherence of local memory reads of CUs in different agents a
9439 ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
9440 lines if configured to have multiple L2 caches.
9442 * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
9443 UC (uncached) which bypasses the L2.
9445 Scalar memory operations are only used to access memory that is proven to not
9446 change during the execution of the kernel dispatch. This includes constant
9447 address space and global address space for program scope ``const`` variables.
9448 Therefore, the kernel machine code does not have to maintain the scalar cache to
9449 ensure it is coherent with the vector caches. The scalar and vector caches are
9450 invalidated between kernel dispatches by CP since constant address space data
9451 may change between kernel dispatch executions. See
9452 :ref:`amdgpu-amdhsa-memory-spaces`.
9454 The one exception is if scalar writes are used to spill SGPR registers. In this
9455 case the AMDGPU backend ensures the memory location used to spill is never
9456 accessed by vector memory operations at the same time. If scalar writes are used
9457 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
9458 return since the locations may be used for vector memory instructions by a
9459 future wavefront that uses the same scratch area, or a function call that
9460 creates a frame at the same address, respectively. There is no need for a
9461 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
9463 For kernarg backing memory:
9465 * CP invalidates the L1 cache at the start of each kernel dispatch.
9466 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
9467 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
9468 cache. This also causes it to be treated as non-volatile and so is not
9469 invalidated by ``*_vol``.
9470 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
9471 so the L2 cache will be coherent with the CPU and other agents.
9473 Scratch backing memory (which is used for the private address space) is accessed
9474 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
9475 only accessed by a single thread, and is always write-before-read, there is
9476 never a need to invalidate these entries from the L1 cache. Hence all cache
9477 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
9479 The code sequences used to implement the memory model for GFX940, GFX941, GFX942
9480 are defined in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-gfx941-gfx942-table`.
9482 .. table:: AMDHSA Memory Model Code Sequences GFX940, GFX941, GFX942
9483 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-gfx941-gfx942-table
9485 ============ ============ ============== ========== ================================
9486 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
9487 Ordering Sync Scope Address GFX940, GFX941, GFX942
9489 ============ ============ ============== ========== ================================
9491 ------------------------------------------------------------------------------------
9492 load *none* *none* - global - !volatile & !nontemporal
9494 - private 1. buffer/global/flat_load
9496 - !volatile & nontemporal
9498 1. buffer/global/flat_load
9503 1. buffer/global/flat_load
9505 2. s_waitcnt vmcnt(0)
9507 - Must happen before
9508 any following volatile
9519 load *none* *none* - local 1. ds_load
9520 store *none* *none* - global - !volatile & !nontemporal
9522 - private 1. GFX940, GFX941
9523 - constant buffer/global/flat_store
9526 buffer/global/flat_store
9528 - !volatile & nontemporal
9531 buffer/global/flat_store
9534 buffer/global/flat_store
9539 1. buffer/global/flat_store
9541 2. s_waitcnt vmcnt(0)
9543 - Must happen before
9544 any following volatile
9555 store *none* *none* - local 1. ds_store
9556 **Unordered Atomic**
9557 ------------------------------------------------------------------------------------
9558 load atomic unordered *any* *any* *Same as non-atomic*.
9559 store atomic unordered *any* *any* *Same as non-atomic*.
9560 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
9561 **Monotonic Atomic**
9562 ------------------------------------------------------------------------------------
9563 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
9564 - wavefront - generic
9565 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
9567 load atomic monotonic - singlethread - local *If TgSplit execution mode,
9568 - wavefront local address space cannot
9569 - workgroup be used.*
9572 load atomic monotonic - agent - global 1. buffer/global/flat_load
9574 load atomic monotonic - system - global 1. buffer/global/flat_load
9575 - generic sc0=1 sc1=1
9576 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
9577 - wavefront - generic
9578 store atomic monotonic - workgroup - global 1. buffer/global/flat_store
9580 store atomic monotonic - agent - global 1. buffer/global/flat_store
9582 store atomic monotonic - system - global 1. buffer/global/flat_store
9583 - generic sc0=1 sc1=1
9584 store atomic monotonic - singlethread - local *If TgSplit execution mode,
9585 - wavefront local address space cannot
9586 - workgroup be used.*
9589 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
9590 - wavefront - generic
9593 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
9595 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
9596 - wavefront local address space cannot
9597 - workgroup be used.*
9601 ------------------------------------------------------------------------------------
9602 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
9605 load atomic acquire - workgroup - global 1. buffer/global_load sc0=1
9606 2. s_waitcnt vmcnt(0)
9608 - If not TgSplit execution
9610 - Must happen before the
9611 following buffer_inv.
9615 - If not TgSplit execution
9617 - Must happen before
9628 load atomic acquire - workgroup - local *If TgSplit execution mode,
9629 local address space cannot
9633 2. s_waitcnt lgkmcnt(0)
9636 - Must happen before
9645 older than the local load
9649 load atomic acquire - workgroup - generic 1. flat_load sc0=1
9650 2. s_waitcnt lgkm/vmcnt(0)
9652 - Use lgkmcnt(0) if not
9653 TgSplit execution mode
9654 and vmcnt(0) if TgSplit
9656 - If OpenCL, omit lgkmcnt(0).
9657 - Must happen before
9660 following global/generic
9667 older than a local load
9673 - If not TgSplit execution
9680 load atomic acquire - agent - global 1. buffer/global_load
9682 2. s_waitcnt vmcnt(0)
9684 - Must happen before
9694 - Must happen before
9704 load atomic acquire - system - global 1. buffer/global/flat_load
9706 2. s_waitcnt vmcnt(0)
9708 - Must happen before
9716 3. buffer_inv sc0=1 sc1=1
9718 - Must happen before
9726 stale MTYPE NC global data.
9727 MTYPE RW and CC memory will
9728 never be stale due to the
9731 load atomic acquire - agent - generic 1. flat_load sc1=1
9732 2. s_waitcnt vmcnt(0) &
9735 - If TgSplit execution mode,
9739 - Must happen before
9742 - Ensures the flat_load
9749 - Must happen before
9759 load atomic acquire - system - generic 1. flat_load sc0=1 sc1=1
9760 2. s_waitcnt vmcnt(0) &
9763 - If TgSplit execution mode,
9767 - Must happen before
9770 - Ensures the flat_load
9775 3. buffer_inv sc0=1 sc1=1
9777 - Must happen before
9785 stale MTYPE NC global data.
9786 MTYPE RW and CC memory will
9787 never be stale due to the
9790 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
9791 - wavefront - generic
9792 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
9793 - wavefront local address space cannot
9797 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
9798 2. s_waitcnt vmcnt(0)
9800 - If not TgSplit execution
9802 - Must happen before the
9803 following buffer_inv.
9804 - Ensures the atomicrmw
9811 - If not TgSplit execution
9813 - Must happen before
9823 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
9824 local address space cannot
9828 2. s_waitcnt lgkmcnt(0)
9831 - Must happen before
9840 older than the local
9844 atomicrmw acquire - workgroup - generic 1. flat_atomic
9845 2. s_waitcnt lgkm/vmcnt(0)
9847 - Use lgkmcnt(0) if not
9848 TgSplit execution mode
9849 and vmcnt(0) if TgSplit
9851 - If OpenCL, omit lgkmcnt(0).
9852 - Must happen before
9869 - If not TgSplit execution
9876 atomicrmw acquire - agent - global 1. buffer/global_atomic
9877 2. s_waitcnt vmcnt(0)
9879 - Must happen before
9890 - Must happen before
9900 atomicrmw acquire - system - global 1. buffer/global_atomic
9902 2. s_waitcnt vmcnt(0)
9904 - Must happen before
9913 3. buffer_inv sc0=1 sc1=1
9915 - Must happen before
9923 stale MTYPE NC global data.
9924 MTYPE RW and CC memory will
9925 never be stale due to the
9928 atomicrmw acquire - agent - generic 1. flat_atomic
9929 2. s_waitcnt vmcnt(0) &
9932 - If TgSplit execution mode,
9936 - Must happen before
9947 - Must happen before
9957 atomicrmw acquire - system - generic 1. flat_atomic sc1=1
9958 2. s_waitcnt vmcnt(0) &
9961 - If TgSplit execution mode,
9965 - Must happen before
9974 3. buffer_inv sc0=1 sc1=1
9976 - Must happen before
9984 stale MTYPE NC global data.
9985 MTYPE RW and CC memory will
9986 never be stale due to the
9989 fence acquire - singlethread *none* *none*
9991 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
9993 - Use lgkmcnt(0) if not
9994 TgSplit execution mode
9995 and vmcnt(0) if TgSplit
10005 - However, since LLVM
10010 always generate. If
10020 - s_waitcnt vmcnt(0)
10023 global/generic load
10028 and memory ordering
10032 fence-paired-atomic).
10033 - s_waitcnt lgkmcnt(0)
10040 and memory ordering
10044 fence-paired-atomic).
10045 - Must happen before
10058 fence-paired-atomic.
10060 3. buffer_inv sc0=1
10062 - If not TgSplit execution
10069 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
10072 - If TgSplit execution mode,
10078 - However, since LLVM
10086 - Could be split into
10090 lgkmcnt(0) to allow
10092 independently moved
10095 - s_waitcnt vmcnt(0)
10098 global/generic load
10102 and memory ordering
10106 fence-paired-atomic).
10107 - s_waitcnt lgkmcnt(0)
10114 and memory ordering
10118 fence-paired-atomic).
10119 - Must happen before
10123 fence-paired atomic
10125 before invalidating
10129 locations read must
10133 fence-paired-atomic.
10135 2. buffer_inv sc1=1
10137 - Must happen before any
10138 following global/generic
10147 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
10150 - If TgSplit execution mode,
10156 - However, since LLVM
10164 - Could be split into
10168 lgkmcnt(0) to allow
10170 independently moved
10173 - s_waitcnt vmcnt(0)
10176 global/generic load
10180 and memory ordering
10184 fence-paired-atomic).
10185 - s_waitcnt lgkmcnt(0)
10192 and memory ordering
10196 fence-paired-atomic).
10197 - Must happen before
10201 fence-paired atomic
10203 before invalidating
10207 locations read must
10211 fence-paired-atomic.
10213 2. buffer_inv sc0=1 sc1=1
10215 - Must happen before any
10216 following global/generic
10226 ------------------------------------------------------------------------------------
10227 store atomic release - singlethread - global 1. GFX940, GFX941
10228 - wavefront - generic buffer/global/flat_store
10231 buffer/global/flat_store
10233 store atomic release - singlethread - local *If TgSplit execution mode,
10234 - wavefront local address space cannot
10238 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10240 - Use lgkmcnt(0) if not
10241 TgSplit execution mode
10242 and vmcnt(0) if TgSplit
10244 - If OpenCL, omit lgkmcnt(0).
10245 - s_waitcnt vmcnt(0)
10248 global/generic load/store/
10249 load atomic/store atomic/
10251 - s_waitcnt lgkmcnt(0)
10258 - Must happen before
10266 store that is being
10270 buffer/global/flat_store
10273 buffer/global/flat_store
10275 store atomic release - workgroup - local *If TgSplit execution mode,
10276 local address space cannot
10280 store atomic release - agent - global 1. buffer_wbl2 sc1=1
10282 - Must happen before
10283 following s_waitcnt.
10284 - Performs L2 writeback to
10287 store/atomicrmw are
10288 visible at agent scope.
10290 2. s_waitcnt lgkmcnt(0) &
10293 - If TgSplit execution mode,
10299 - Could be split into
10303 lgkmcnt(0) to allow
10305 independently moved
10308 - s_waitcnt vmcnt(0)
10315 - s_waitcnt lgkmcnt(0)
10322 - Must happen before
10330 store that is being
10334 buffer/global/flat_store
10337 buffer/global/flat_store
10339 store atomic release - system - global 1. buffer_wbl2 sc0=1 sc1=1
10341 - Must happen before
10342 following s_waitcnt.
10343 - Performs L2 writeback to
10346 store/atomicrmw are
10347 visible at system scope.
10349 2. s_waitcnt lgkmcnt(0) &
10352 - If TgSplit execution mode,
10358 - Could be split into
10362 lgkmcnt(0) to allow
10364 independently moved
10367 - s_waitcnt vmcnt(0)
10368 must happen after any
10374 - s_waitcnt lgkmcnt(0)
10375 must happen after any
10381 - Must happen before
10386 to memory and the L2
10390 store that is being
10393 3. buffer/global/flat_store
10395 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
10396 - wavefront - generic
10397 atomicrmw release - singlethread - local *If TgSplit execution mode,
10398 - wavefront local address space cannot
10402 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10404 - Use lgkmcnt(0) if not
10405 TgSplit execution mode
10406 and vmcnt(0) if TgSplit
10410 - s_waitcnt vmcnt(0)
10413 global/generic load/store/
10414 load atomic/store atomic/
10416 - s_waitcnt lgkmcnt(0)
10423 - Must happen before
10434 2. buffer/global/flat_atomic sc0=1
10435 atomicrmw release - workgroup - local *If TgSplit execution mode,
10436 local address space cannot
10440 atomicrmw release - agent - global 1. buffer_wbl2 sc1=1
10442 - Must happen before
10443 following s_waitcnt.
10444 - Performs L2 writeback to
10447 store/atomicrmw are
10448 visible at agent scope.
10450 2. s_waitcnt lgkmcnt(0) &
10453 - If TgSplit execution mode,
10457 - Could be split into
10461 lgkmcnt(0) to allow
10463 independently moved
10466 - s_waitcnt vmcnt(0)
10473 - s_waitcnt lgkmcnt(0)
10480 - Must happen before
10485 to global and local
10491 3. buffer/global/flat_atomic sc1=1
10492 atomicrmw release - system - global 1. buffer_wbl2 sc0=1 sc1=1
10494 - Must happen before
10495 following s_waitcnt.
10496 - Performs L2 writeback to
10499 store/atomicrmw are
10500 visible at system scope.
10502 2. s_waitcnt lgkmcnt(0) &
10505 - If TgSplit execution mode,
10509 - Could be split into
10513 lgkmcnt(0) to allow
10515 independently moved
10518 - s_waitcnt vmcnt(0)
10525 - s_waitcnt lgkmcnt(0)
10532 - Must happen before
10537 to memory and the L2
10541 store that is being
10544 3. buffer/global/flat_atomic
10546 fence release - singlethread *none* *none*
10548 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
10550 - Use lgkmcnt(0) if not
10551 TgSplit execution mode
10552 and vmcnt(0) if TgSplit
10562 - However, since LLVM
10567 always generate. If
10577 - s_waitcnt vmcnt(0)
10582 load atomic/store atomic/
10584 - s_waitcnt lgkmcnt(0)
10591 - Must happen before
10592 any following store
10596 and memory ordering
10600 fence-paired-atomic).
10607 fence-paired-atomic.
10609 fence release - agent *none* 1. buffer_wbl2 sc1=1
10614 - Must happen before
10615 following s_waitcnt.
10616 - Performs L2 writeback to
10619 store/atomicrmw are
10620 visible at agent scope.
10622 2. s_waitcnt lgkmcnt(0) &
10625 - If TgSplit execution mode,
10635 - However, since LLVM
10640 always generate. If
10650 - Could be split into
10654 lgkmcnt(0) to allow
10656 independently moved
10659 - s_waitcnt vmcnt(0)
10666 - s_waitcnt lgkmcnt(0)
10673 - Must happen before
10674 any following store
10678 and memory ordering
10682 fence-paired-atomic).
10689 fence-paired-atomic.
10691 fence release - system *none* 1. buffer_wbl2 sc0=1 sc1=1
10693 - Must happen before
10694 following s_waitcnt.
10695 - Performs L2 writeback to
10698 store/atomicrmw are
10699 visible at system scope.
10701 2. s_waitcnt lgkmcnt(0) &
10704 - If TgSplit execution mode,
10714 - However, since LLVM
10719 always generate. If
10729 - Could be split into
10733 lgkmcnt(0) to allow
10735 independently moved
10738 - s_waitcnt vmcnt(0)
10745 - s_waitcnt lgkmcnt(0)
10752 - Must happen before
10753 any following store
10757 and memory ordering
10761 fence-paired-atomic).
10768 fence-paired-atomic.
10770 **Acquire-Release Atomic**
10771 ------------------------------------------------------------------------------------
10772 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
10773 - wavefront - generic
10774 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
10775 - wavefront local address space cannot
10779 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10781 - Use lgkmcnt(0) if not
10782 TgSplit execution mode
10783 and vmcnt(0) if TgSplit
10787 - Must happen after
10793 - s_waitcnt vmcnt(0)
10796 global/generic load/store/
10797 load atomic/store atomic/
10799 - s_waitcnt lgkmcnt(0)
10806 - Must happen before
10817 2. buffer/global_atomic
10818 3. s_waitcnt vmcnt(0)
10820 - If not TgSplit execution
10822 - Must happen before
10832 4. buffer_inv sc0=1
10834 - If not TgSplit execution
10841 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
10842 local address space cannot
10846 2. s_waitcnt lgkmcnt(0)
10849 - Must happen before
10858 older than the local load
10862 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
10864 - Use lgkmcnt(0) if not
10865 TgSplit execution mode
10866 and vmcnt(0) if TgSplit
10870 - s_waitcnt vmcnt(0)
10873 global/generic load/store/
10874 load atomic/store atomic/
10876 - s_waitcnt lgkmcnt(0)
10883 - Must happen before
10895 3. s_waitcnt lgkmcnt(0) &
10898 - If not TgSplit execution
10899 mode, omit vmcnt(0).
10902 - Must happen before
10913 older than a local load
10917 3. buffer_inv sc0=1
10919 - If not TgSplit execution
10926 atomicrmw acq_rel - agent - global 1. buffer_wbl2 sc1=1
10928 - Must happen before
10929 following s_waitcnt.
10930 - Performs L2 writeback to
10933 store/atomicrmw are
10934 visible at agent scope.
10936 2. s_waitcnt lgkmcnt(0) &
10939 - If TgSplit execution mode,
10943 - Could be split into
10947 lgkmcnt(0) to allow
10949 independently moved
10952 - s_waitcnt vmcnt(0)
10959 - s_waitcnt lgkmcnt(0)
10966 - Must happen before
10977 3. buffer/global_atomic
10978 4. s_waitcnt vmcnt(0)
10980 - Must happen before
10989 5. buffer_inv sc1=1
10991 - Must happen before
11001 atomicrmw acq_rel - system - global 1. buffer_wbl2 sc0=1 sc1=1
11003 - Must happen before
11004 following s_waitcnt.
11005 - Performs L2 writeback to
11008 store/atomicrmw are
11009 visible at system scope.
11011 2. s_waitcnt lgkmcnt(0) &
11014 - If TgSplit execution mode,
11018 - Could be split into
11022 lgkmcnt(0) to allow
11024 independently moved
11027 - s_waitcnt vmcnt(0)
11034 - s_waitcnt lgkmcnt(0)
11041 - Must happen before
11046 to global and L2 writeback
11047 have completed before
11052 3. buffer/global_atomic
11054 4. s_waitcnt vmcnt(0)
11056 - Must happen before
11065 5. buffer_inv sc0=1 sc1=1
11067 - Must happen before
11075 MTYPE NC global data.
11076 MTYPE RW and CC memory will
11077 never be stale due to the
11080 atomicrmw acq_rel - agent - generic 1. buffer_wbl2 sc1=1
11082 - Must happen before
11083 following s_waitcnt.
11084 - Performs L2 writeback to
11087 store/atomicrmw are
11088 visible at agent scope.
11090 2. s_waitcnt lgkmcnt(0) &
11093 - If TgSplit execution mode,
11097 - Could be split into
11101 lgkmcnt(0) to allow
11103 independently moved
11106 - s_waitcnt vmcnt(0)
11113 - s_waitcnt lgkmcnt(0)
11120 - Must happen before
11132 4. s_waitcnt vmcnt(0) &
11135 - If TgSplit execution mode,
11139 - Must happen before
11148 5. buffer_inv sc1=1
11150 - Must happen before
11160 atomicrmw acq_rel - system - generic 1. buffer_wbl2 sc0=1 sc1=1
11162 - Must happen before
11163 following s_waitcnt.
11164 - Performs L2 writeback to
11167 store/atomicrmw are
11168 visible at system scope.
11170 2. s_waitcnt lgkmcnt(0) &
11173 - If TgSplit execution mode,
11177 - Could be split into
11181 lgkmcnt(0) to allow
11183 independently moved
11186 - s_waitcnt vmcnt(0)
11193 - s_waitcnt lgkmcnt(0)
11200 - Must happen before
11205 to global and L2 writeback
11206 have completed before
11211 3. flat_atomic sc1=1
11212 4. s_waitcnt vmcnt(0) &
11215 - If TgSplit execution mode,
11219 - Must happen before
11228 5. buffer_inv sc0=1 sc1=1
11230 - Must happen before
11238 MTYPE NC global data.
11239 MTYPE RW and CC memory will
11240 never be stale due to the
11243 fence acq_rel - singlethread *none* *none*
11245 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
11247 - Use lgkmcnt(0) if not
11248 TgSplit execution mode
11249 and vmcnt(0) if TgSplit
11268 - s_waitcnt vmcnt(0)
11273 load atomic/store atomic/
11275 - s_waitcnt lgkmcnt(0)
11282 - Must happen before
11301 and memory ordering
11305 acquire-fence-paired-atomic)
11318 local/generic store
11322 and memory ordering
11326 release-fence-paired-atomic).
11330 - Must happen before
11334 acquire-fence-paired
11335 atomic has completed
11336 before invalidating
11340 locations read must
11344 acquire-fence-paired-atomic.
11346 3. buffer_inv sc0=1
11348 - If not TgSplit execution
11355 fence acq_rel - agent *none* 1. buffer_wbl2 sc1=1
11360 - Must happen before
11361 following s_waitcnt.
11362 - Performs L2 writeback to
11365 store/atomicrmw are
11366 visible at agent scope.
11368 2. s_waitcnt lgkmcnt(0) &
11371 - If TgSplit execution mode,
11377 - However, since LLVM
11385 - Could be split into
11389 lgkmcnt(0) to allow
11391 independently moved
11394 - s_waitcnt vmcnt(0)
11401 - s_waitcnt lgkmcnt(0)
11408 - Must happen before
11413 global/local/generic
11418 and memory ordering
11422 acquire-fence-paired-atomic)
11424 before invalidating
11434 global/local/generic
11439 and memory ordering
11443 release-fence-paired-atomic).
11448 3. buffer_inv sc1=1
11450 - Must happen before
11464 fence acq_rel - system *none* 1. buffer_wbl2 sc0=1 sc1=1
11469 - Must happen before
11470 following s_waitcnt.
11471 - Performs L2 writeback to
11474 store/atomicrmw are
11475 visible at system scope.
11477 1. s_waitcnt lgkmcnt(0) &
11480 - If TgSplit execution mode,
11486 - However, since LLVM
11494 - Could be split into
11498 lgkmcnt(0) to allow
11500 independently moved
11503 - s_waitcnt vmcnt(0)
11510 - s_waitcnt lgkmcnt(0)
11517 - Must happen before
11522 global/local/generic
11527 and memory ordering
11531 acquire-fence-paired-atomic)
11533 before invalidating
11543 global/local/generic
11548 and memory ordering
11552 release-fence-paired-atomic).
11557 2. buffer_inv sc0=1 sc1=1
11559 - Must happen before
11568 MTYPE NC global data.
11569 MTYPE RW and CC memory will
11570 never be stale due to the
11573 **Sequential Consistent Atomic**
11574 ------------------------------------------------------------------------------------
11575 load atomic seq_cst - singlethread - global *Same as corresponding
11576 - wavefront - local load atomic acquire,
11577 - generic except must generate
11578 all instructions even
11580 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
11582 - Use lgkmcnt(0) if not
11583 TgSplit execution mode
11584 and vmcnt(0) if TgSplit
11586 - s_waitcnt lgkmcnt(0) must
11593 ordering of seq_cst
11599 lgkmcnt(0) and so do
11602 - s_waitcnt vmcnt(0)
11605 global/generic load
11609 ordering of seq_cst
11621 consistent global/local
11622 memory instructions
11628 prevents reordering
11631 seq_cst load. (Note
11637 followed by a store
11644 release followed by
11647 order. The s_waitcnt
11648 could be placed after
11649 seq_store or before
11652 make the s_waitcnt be
11653 as late as possible
11659 instructions same as
11662 except must generate
11663 all instructions even
11665 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
11666 local address space cannot
11669 *Same as corresponding
11670 load atomic acquire,
11671 except must generate
11672 all instructions even
11675 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
11676 - system - generic vmcnt(0)
11678 - If TgSplit execution mode,
11680 - Could be split into
11684 lgkmcnt(0) to allow
11686 independently moved
11689 - s_waitcnt lgkmcnt(0)
11692 global/generic load
11696 ordering of seq_cst
11702 lgkmcnt(0) and so do
11705 - s_waitcnt vmcnt(0)
11708 global/generic load
11712 ordering of seq_cst
11725 memory instructions
11731 prevents reordering
11734 seq_cst load. (Note
11740 followed by a store
11747 release followed by
11750 order. The s_waitcnt
11751 could be placed after
11752 seq_store or before
11755 make the s_waitcnt be
11756 as late as possible
11762 instructions same as
11765 except must generate
11766 all instructions even
11768 store atomic seq_cst - singlethread - global *Same as corresponding
11769 - wavefront - local store atomic release,
11770 - workgroup - generic except must generate
11771 - agent all instructions even
11772 - system for OpenCL.*
11773 atomicrmw seq_cst - singlethread - global *Same as corresponding
11774 - wavefront - local atomicrmw acq_rel,
11775 - workgroup - generic except must generate
11776 - agent all instructions even
11777 - system for OpenCL.*
11778 fence seq_cst - singlethread *none* *Same as corresponding
11779 - wavefront fence acq_rel,
11780 - workgroup except must generate
11781 - agent all instructions even
11782 - system for OpenCL.*
11783 ============ ============ ============== ========== ================================
11785 .. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
11787 Memory Model GFX10-GFX11
11788 ++++++++++++++++++++++++
11792 * Each agent has multiple shader arrays (SA).
11793 * Each SA has multiple work-group processors (WGP).
11794 * Each WGP has multiple compute units (CU).
11795 * Each CU has multiple SIMDs that execute wavefronts.
11796 * The wavefronts for a single work-group are executed in the same
11797 WGP. In CU wavefront execution mode the wavefronts may be executed by
11798 different SIMDs in the same CU. In WGP wavefront execution mode the
11799 wavefronts may be executed by different SIMDs in different CUs in the same
11801 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
11803 * All LDS operations of a WGP are performed as wavefront wide operations in a
11804 global order and involve no caching. Completion is reported to a wavefront in
11806 * The LDS memory has multiple request queues shared by the SIMDs of a
11807 WGP. Therefore, the LDS operations performed by different wavefronts of a
11808 work-group can be reordered relative to each other, which can result in
11809 reordering the visibility of vector memory operations with respect to LDS
11810 operations of other wavefronts in the same work-group. A ``s_waitcnt
11811 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
11812 vector memory operations between wavefronts of a work-group, but not between
11813 operations performed by the same wavefront.
11814 * The vector memory operations are performed as wavefront wide operations.
11815 Completion of load/store/sample operations are reported to a wavefront in
11816 execution order of other load/store/sample operations performed by that
11818 * The vector memory operations access a vector L0 cache. There is a single L0
11819 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
11820 special action is required for coherence between the lanes of a single
11821 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
11822 wavefronts executing in the same work-group as they may be executing on SIMDs
11823 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
11824 required for coherence between wavefronts executing in different work-groups
11825 as they may be executing on different WGPs.
11826 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
11827 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
11828 operations are used in a restricted way so do not impact the memory model. See
11829 :ref:`amdgpu-amdhsa-memory-spaces`.
11830 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
11831 the same SA. Therefore, no special action is required for coherence between
11832 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
11833 required for coherence between wavefronts executing in different work-groups
11834 as they may be executing on different SAs that access different L1s.
11835 * The L1 caches have independent quadrants to service disjoint ranges of virtual
11837 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
11838 vector and scalar memory operations performed by different wavefronts, whether
11839 executing in the same or different work-groups (which may be executing on
11840 different CUs accessing different L0s), can be reordered relative to each
11841 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
11842 synchronization between vector memory operations of different wavefronts. It
11843 ensures a previous vector memory operation has completed before executing a
11844 subsequent vector memory or LDS operation and so can be used to meet the
11845 requirements of acquire, release and sequential consistency.
11846 * The L1 caches use an L2 cache shared by all SAs on the same agent.
11847 * The L2 cache has independent channels to service disjoint ranges of virtual
11849 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
11850 quadrant has a separate request queue per L2 channel. Therefore, the vector
11851 and scalar memory operations performed by wavefronts executing in different
11852 work-groups (which may be executing on different SAs) of an agent can be
11853 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
11854 required to ensure synchronization between vector memory operations of
11855 different SAs. It ensures a previous vector memory operation has completed
11856 before executing a subsequent vector memory and so can be used to meet the
11857 requirements of acquire, release and sequential consistency.
11858 * The L2 cache can be kept coherent with other agents on some targets, or ranges
11859 of virtual addresses can be set up to bypass it to ensure system coherence.
11860 * On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
11861 The MALL cache is fully coherent with GPU memory and has no impact on system
11862 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
11864 Scalar memory operations are only used to access memory that is proven to not
11865 change during the execution of the kernel dispatch. This includes constant
11866 address space and global address space for program scope ``const`` variables.
11867 Therefore, the kernel machine code does not have to maintain the scalar cache to
11868 ensure it is coherent with the vector caches. The scalar and vector caches are
11869 invalidated between kernel dispatches by CP since constant address space data
11870 may change between kernel dispatch executions. See
11871 :ref:`amdgpu-amdhsa-memory-spaces`.
11873 The one exception is if scalar writes are used to spill SGPR registers. In this
11874 case the AMDGPU backend ensures the memory location used to spill is never
11875 accessed by vector memory operations at the same time. If scalar writes are used
11876 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
11877 return since the locations may be used for vector memory instructions by a
11878 future wavefront that uses the same scratch area, or a function call that
11879 creates a frame at the same address, respectively. There is no need for a
11880 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
11882 For kernarg backing memory:
11884 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
11885 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
11886 needing to invalidate the L2 cache.
11887 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
11888 so the L2 cache will be coherent with the CPU and other agents.
11890 Scratch backing memory (which is used for the private address space) is accessed
11891 with MTYPE NC (non-coherent). Since the private address space is only accessed
11892 by a single thread, and is always write-before-read, there is never a need to
11893 invalidate these entries from the L0 or L1 caches.
11895 Wavefronts are executed in native mode with in-order reporting of loads and
11896 sample instructions. In this mode vmcnt reports completion of load, atomic with
11897 return and sample instructions in order, and the vscnt reports the completion of
11898 store and atomic without return in order. See ``MEM_ORDERED`` field in
11899 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
11901 Wavefronts can be executed in WGP or CU wavefront execution mode:
11903 * In WGP wavefront execution mode the wavefronts of a work-group are executed
11904 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
11905 CU L0 caches is required for work-group synchronization. Also accesses to L1
11906 at work-group scope need to be explicitly ordered as the accesses from
11907 different CUs are not ordered.
11908 * In CU wavefront execution mode the wavefronts of a work-group are executed on
11909 the SIMDs of a single CU of the WGP. Therefore, all global memory access by
11910 the work-group access the same L0 which in turn ensures L1 accesses are
11911 ordered and so do not require explicit management of the caches for
11912 work-group synchronization.
11914 See ``WGP_MODE`` field in
11915 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table` and
11916 :ref:`amdgpu-target-features`.
11918 The code sequences used to implement the memory model for GFX10-GFX11 are defined in
11919 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
11921 .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
11922 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
11924 ============ ============ ============== ========== ================================
11925 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
11926 Ordering Sync Scope Address GFX10-GFX11
11928 ============ ============ ============== ========== ================================
11930 ------------------------------------------------------------------------------------
11931 load *none* *none* - global - !volatile & !nontemporal
11933 - private 1. buffer/global/flat_load
11935 - !volatile & nontemporal
11937 1. buffer/global/flat_load
11940 - If GFX10, omit dlc=1.
11944 1. buffer/global/flat_load
11947 2. s_waitcnt vmcnt(0)
11949 - Must happen before
11950 any following volatile
11961 load *none* *none* - local 1. ds_load
11962 store *none* *none* - global - !volatile & !nontemporal
11964 - private 1. buffer/global/flat_store
11966 - !volatile & nontemporal
11968 1. buffer/global/flat_store
11971 - If GFX10, omit dlc=1.
11975 1. buffer/global/flat_store
11978 - If GFX10, omit dlc=1.
11980 2. s_waitcnt vscnt(0)
11982 - Must happen before
11983 any following volatile
11994 store *none* *none* - local 1. ds_store
11995 **Unordered Atomic**
11996 ------------------------------------------------------------------------------------
11997 load atomic unordered *any* *any* *Same as non-atomic*.
11998 store atomic unordered *any* *any* *Same as non-atomic*.
11999 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
12000 **Monotonic Atomic**
12001 ------------------------------------------------------------------------------------
12002 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
12003 - wavefront - generic
12004 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
12007 - If CU wavefront execution
12010 load atomic monotonic - singlethread - local 1. ds_load
12013 load atomic monotonic - agent - global 1. buffer/global/flat_load
12014 - system - generic glc=1 dlc=1
12016 - If GFX11, omit dlc=1.
12018 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
12019 - wavefront - generic
12023 store atomic monotonic - singlethread - local 1. ds_store
12026 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
12027 - wavefront - generic
12031 atomicrmw monotonic - singlethread - local 1. ds_atomic
12035 ------------------------------------------------------------------------------------
12036 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
12037 - wavefront - local
12039 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
12041 - If CU wavefront execution
12044 2. s_waitcnt vmcnt(0)
12046 - If CU wavefront execution
12048 - Must happen before
12049 the following buffer_gl0_inv
12050 and before any following
12058 - If CU wavefront execution
12065 load atomic acquire - workgroup - local 1. ds_load
12066 2. s_waitcnt lgkmcnt(0)
12069 - Must happen before
12070 the following buffer_gl0_inv
12071 and before any following
12072 global/generic load/load
12078 older than the local load
12084 - If CU wavefront execution
12092 load atomic acquire - workgroup - generic 1. flat_load glc=1
12094 - If CU wavefront execution
12097 2. s_waitcnt lgkmcnt(0) &
12100 - If CU wavefront execution
12101 mode, omit vmcnt(0).
12104 - Must happen before
12106 buffer_gl0_inv and any
12107 following global/generic
12114 older than a local load
12120 - If CU wavefront execution
12127 load atomic acquire - agent - global 1. buffer/global_load
12128 - system glc=1 dlc=1
12130 - If GFX11, omit dlc=1.
12132 2. s_waitcnt vmcnt(0)
12134 - Must happen before
12139 before invalidating
12145 - Must happen before
12155 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1
12157 - If GFX11, omit dlc=1.
12159 2. s_waitcnt vmcnt(0) &
12164 - Must happen before
12167 - Ensures the flat_load
12169 before invalidating
12175 - Must happen before
12185 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
12186 - wavefront - local
12188 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
12189 2. s_waitcnt vm/vscnt(0)
12191 - If CU wavefront execution
12193 - Use vmcnt(0) if atomic with
12194 return and vscnt(0) if
12195 atomic with no-return.
12196 - Must happen before
12197 the following buffer_gl0_inv
12198 and before any following
12206 - If CU wavefront execution
12213 atomicrmw acquire - workgroup - local 1. ds_atomic
12214 2. s_waitcnt lgkmcnt(0)
12217 - Must happen before
12223 older than the local
12235 atomicrmw acquire - workgroup - generic 1. flat_atomic
12236 2. s_waitcnt lgkmcnt(0) &
12239 - If CU wavefront execution
12240 mode, omit vm/vscnt(0).
12241 - If OpenCL, omit lgkmcnt(0).
12242 - Use vmcnt(0) if atomic with
12243 return and vscnt(0) if
12244 atomic with no-return.
12245 - Must happen before
12257 - If CU wavefront execution
12264 atomicrmw acquire - agent - global 1. buffer/global_atomic
12265 - system 2. s_waitcnt vm/vscnt(0)
12267 - Use vmcnt(0) if atomic with
12268 return and vscnt(0) if
12269 atomic with no-return.
12270 - Must happen before
12282 - Must happen before
12292 atomicrmw acquire - agent - generic 1. flat_atomic
12293 - system 2. s_waitcnt vm/vscnt(0) &
12298 - Use vmcnt(0) if atomic with
12299 return and vscnt(0) if
12300 atomic with no-return.
12301 - Must happen before
12313 - Must happen before
12323 fence acquire - singlethread *none* *none*
12325 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12326 vmcnt(0) & vscnt(0)
12328 - If CU wavefront execution
12329 mode, omit vmcnt(0) and
12338 vmcnt(0) and vscnt(0).
12339 - However, since LLVM
12344 always generate. If
12354 - Could be split into
12356 vmcnt(0), s_waitcnt
12357 vscnt(0) and s_waitcnt
12358 lgkmcnt(0) to allow
12360 independently moved
12363 - s_waitcnt vmcnt(0)
12366 global/generic load
12368 atomicrmw-with-return-value
12371 and memory ordering
12375 fence-paired-atomic).
12376 - s_waitcnt vscnt(0)
12380 atomicrmw-no-return-value
12383 and memory ordering
12387 fence-paired-atomic).
12388 - s_waitcnt lgkmcnt(0)
12395 and memory ordering
12399 fence-paired-atomic).
12400 - Must happen before
12404 fence-paired atomic
12406 before invalidating
12410 locations read must
12414 fence-paired-atomic.
12418 - If CU wavefront execution
12425 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
12426 - system vmcnt(0) & vscnt(0)
12435 vmcnt(0) and vscnt(0).
12436 - However, since LLVM
12444 - Could be split into
12446 vmcnt(0), s_waitcnt
12447 vscnt(0) and s_waitcnt
12448 lgkmcnt(0) to allow
12450 independently moved
12453 - s_waitcnt vmcnt(0)
12456 global/generic load
12458 atomicrmw-with-return-value
12461 and memory ordering
12465 fence-paired-atomic).
12466 - s_waitcnt vscnt(0)
12470 atomicrmw-no-return-value
12473 and memory ordering
12477 fence-paired-atomic).
12478 - s_waitcnt lgkmcnt(0)
12485 and memory ordering
12489 fence-paired-atomic).
12490 - Must happen before
12494 fence-paired atomic
12496 before invalidating
12500 locations read must
12504 fence-paired-atomic.
12509 - Must happen before any
12510 following global/generic
12520 ------------------------------------------------------------------------------------
12521 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
12522 - wavefront - local
12524 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12525 - generic vmcnt(0) & vscnt(0)
12527 - If CU wavefront execution
12528 mode, omit vmcnt(0) and
12532 - Could be split into
12534 vmcnt(0), s_waitcnt
12535 vscnt(0) and s_waitcnt
12536 lgkmcnt(0) to allow
12538 independently moved
12541 - s_waitcnt vmcnt(0)
12544 global/generic load/load
12546 atomicrmw-with-return-value.
12547 - s_waitcnt vscnt(0)
12553 atomicrmw-no-return-value.
12554 - s_waitcnt lgkmcnt(0)
12561 - Must happen before
12569 store that is being
12572 2. buffer/global/flat_store
12573 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12575 - If CU wavefront execution
12578 - Could be split into
12580 vmcnt(0) and s_waitcnt
12583 independently moved
12586 - s_waitcnt vmcnt(0)
12589 global/generic load/load
12591 atomicrmw-with-return-value.
12592 - s_waitcnt vscnt(0)
12596 store/store atomic/
12597 atomicrmw-no-return-value.
12598 - Must happen before
12606 store that is being
12610 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
12611 - system - generic vmcnt(0) & vscnt(0)
12617 - Could be split into
12619 vmcnt(0), s_waitcnt vscnt(0)
12621 lgkmcnt(0) to allow
12623 independently moved
12626 - s_waitcnt vmcnt(0)
12632 atomicrmw-with-return-value.
12633 - s_waitcnt vscnt(0)
12637 store/store atomic/
12638 atomicrmw-no-return-value.
12639 - s_waitcnt lgkmcnt(0)
12646 - Must happen before
12654 store that is being
12657 2. buffer/global/flat_store
12658 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
12659 - wavefront - local
12661 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12662 - generic vmcnt(0) & vscnt(0)
12664 - If CU wavefront execution
12665 mode, omit vmcnt(0) and
12667 - If OpenCL, omit lgkmcnt(0).
12668 - Could be split into
12670 vmcnt(0), s_waitcnt
12671 vscnt(0) and s_waitcnt
12672 lgkmcnt(0) to allow
12674 independently moved
12677 - s_waitcnt vmcnt(0)
12680 global/generic load/load
12682 atomicrmw-with-return-value.
12683 - s_waitcnt vscnt(0)
12689 atomicrmw-no-return-value.
12690 - s_waitcnt lgkmcnt(0)
12697 - Must happen before
12708 2. buffer/global/flat_atomic
12709 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12711 - If CU wavefront execution
12714 - Could be split into
12716 vmcnt(0) and s_waitcnt
12719 independently moved
12722 - s_waitcnt vmcnt(0)
12725 global/generic load/load
12727 atomicrmw-with-return-value.
12728 - s_waitcnt vscnt(0)
12732 store/store atomic/
12733 atomicrmw-no-return-value.
12734 - Must happen before
12742 store that is being
12746 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
12747 - system - generic vmcnt(0) & vscnt(0)
12751 - Could be split into
12753 vmcnt(0), s_waitcnt
12754 vscnt(0) and s_waitcnt
12755 lgkmcnt(0) to allow
12757 independently moved
12760 - s_waitcnt vmcnt(0)
12765 atomicrmw-with-return-value.
12766 - s_waitcnt vscnt(0)
12770 store/store atomic/
12771 atomicrmw-no-return-value.
12772 - s_waitcnt lgkmcnt(0)
12779 - Must happen before
12784 to global and local
12790 2. buffer/global/flat_atomic
12791 fence release - singlethread *none* *none*
12793 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12794 vmcnt(0) & vscnt(0)
12796 - If CU wavefront execution
12797 mode, omit vmcnt(0) and
12806 vmcnt(0) and vscnt(0).
12807 - However, since LLVM
12812 always generate. If
12822 - Could be split into
12824 vmcnt(0), s_waitcnt
12825 vscnt(0) and s_waitcnt
12826 lgkmcnt(0) to allow
12828 independently moved
12831 - s_waitcnt vmcnt(0)
12837 atomicrmw-with-return-value.
12838 - s_waitcnt vscnt(0)
12842 store/store atomic/
12843 atomicrmw-no-return-value.
12844 - s_waitcnt lgkmcnt(0)
12849 atomic/store atomic/
12851 - Must happen before
12852 any following store
12856 and memory ordering
12860 fence-paired-atomic).
12867 fence-paired-atomic.
12869 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
12870 - system vmcnt(0) & vscnt(0)
12879 vmcnt(0) and vscnt(0).
12880 - However, since LLVM
12885 always generate. If
12895 - Could be split into
12897 vmcnt(0), s_waitcnt
12898 vscnt(0) and s_waitcnt
12899 lgkmcnt(0) to allow
12901 independently moved
12904 - s_waitcnt vmcnt(0)
12909 atomicrmw-with-return-value.
12910 - s_waitcnt vscnt(0)
12914 store/store atomic/
12915 atomicrmw-no-return-value.
12916 - s_waitcnt lgkmcnt(0)
12923 - Must happen before
12924 any following store
12928 and memory ordering
12932 fence-paired-atomic).
12939 fence-paired-atomic.
12941 **Acquire-Release Atomic**
12942 ------------------------------------------------------------------------------------
12943 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
12944 - wavefront - local
12946 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12947 vmcnt(0) & vscnt(0)
12949 - If CU wavefront execution
12950 mode, omit vmcnt(0) and
12954 - Must happen after
12960 - Could be split into
12962 vmcnt(0), s_waitcnt
12963 vscnt(0), and s_waitcnt
12964 lgkmcnt(0) to allow
12966 independently moved
12969 - s_waitcnt vmcnt(0)
12972 global/generic load/load
12974 atomicrmw-with-return-value.
12975 - s_waitcnt vscnt(0)
12981 atomicrmw-no-return-value.
12982 - s_waitcnt lgkmcnt(0)
12989 - Must happen before
13000 2. buffer/global_atomic
13001 3. s_waitcnt vm/vscnt(0)
13003 - If CU wavefront execution
13005 - Use vmcnt(0) if atomic with
13006 return and vscnt(0) if
13007 atomic with no-return.
13008 - Must happen before
13020 - If CU wavefront execution
13027 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
13029 - If CU wavefront execution
13032 - Could be split into
13034 vmcnt(0) and s_waitcnt
13037 independently moved
13040 - s_waitcnt vmcnt(0)
13043 global/generic load/load
13045 atomicrmw-with-return-value.
13046 - s_waitcnt vscnt(0)
13050 store/store atomic/
13051 atomicrmw-no-return-value.
13052 - Must happen before
13060 store that is being
13064 3. s_waitcnt lgkmcnt(0)
13067 - Must happen before
13073 older than the local load
13079 - If CU wavefront execution
13087 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) &
13088 vmcnt(0) & vscnt(0)
13090 - If CU wavefront execution
13091 mode, omit vmcnt(0) and
13093 - If OpenCL, omit lgkmcnt(0).
13094 - Could be split into
13096 vmcnt(0), s_waitcnt
13097 vscnt(0) and s_waitcnt
13098 lgkmcnt(0) to allow
13100 independently moved
13103 - s_waitcnt vmcnt(0)
13106 global/generic load/load
13108 atomicrmw-with-return-value.
13109 - s_waitcnt vscnt(0)
13115 atomicrmw-no-return-value.
13116 - s_waitcnt lgkmcnt(0)
13123 - Must happen before
13135 3. s_waitcnt lgkmcnt(0) &
13136 vmcnt(0) & vscnt(0)
13138 - If CU wavefront execution
13139 mode, omit vmcnt(0) and
13141 - If OpenCL, omit lgkmcnt(0).
13142 - Must happen before
13148 older than the load
13154 - If CU wavefront execution
13161 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
13162 - system vmcnt(0) & vscnt(0)
13166 - Could be split into
13168 vmcnt(0), s_waitcnt
13169 vscnt(0) and s_waitcnt
13170 lgkmcnt(0) to allow
13172 independently moved
13175 - s_waitcnt vmcnt(0)
13180 atomicrmw-with-return-value.
13181 - s_waitcnt vscnt(0)
13185 store/store atomic/
13186 atomicrmw-no-return-value.
13187 - s_waitcnt lgkmcnt(0)
13194 - Must happen before
13205 2. buffer/global_atomic
13206 3. s_waitcnt vm/vscnt(0)
13208 - Use vmcnt(0) if atomic with
13209 return and vscnt(0) if
13210 atomic with no-return.
13211 - Must happen before
13223 - Must happen before
13233 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
13234 - system vmcnt(0) & vscnt(0)
13238 - Could be split into
13240 vmcnt(0), s_waitcnt
13241 vscnt(0), and s_waitcnt
13242 lgkmcnt(0) to allow
13244 independently moved
13247 - s_waitcnt vmcnt(0)
13252 atomicrmw-with-return-value.
13253 - s_waitcnt vscnt(0)
13257 store/store atomic/
13258 atomicrmw-no-return-value.
13259 - s_waitcnt lgkmcnt(0)
13266 - Must happen before
13278 3. s_waitcnt vm/vscnt(0) &
13283 - Use vmcnt(0) if atomic with
13284 return and vscnt(0) if
13285 atomic with no-return.
13286 - Must happen before
13298 - Must happen before
13308 fence acq_rel - singlethread *none* *none*
13310 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
13311 vmcnt(0) & vscnt(0)
13313 - If CU wavefront execution
13314 mode, omit vmcnt(0) and
13323 vmcnt(0) and vscnt(0).
13333 - Could be split into
13335 vmcnt(0), s_waitcnt
13336 vscnt(0) and s_waitcnt
13337 lgkmcnt(0) to allow
13339 independently moved
13342 - s_waitcnt vmcnt(0)
13348 atomicrmw-with-return-value.
13349 - s_waitcnt vscnt(0)
13353 store/store atomic/
13354 atomicrmw-no-return-value.
13355 - s_waitcnt lgkmcnt(0)
13360 atomic/store atomic/
13362 - Must happen before
13381 and memory ordering
13385 acquire-fence-paired-atomic)
13398 local/generic store
13402 and memory ordering
13406 release-fence-paired-atomic).
13410 - Must happen before
13414 acquire-fence-paired
13415 atomic has completed
13416 before invalidating
13420 locations read must
13424 acquire-fence-paired-atomic.
13428 - If CU wavefront execution
13435 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
13436 - system vmcnt(0) & vscnt(0)
13445 vmcnt(0) and vscnt(0).
13446 - However, since LLVM
13454 - Could be split into
13456 vmcnt(0), s_waitcnt
13457 vscnt(0) and s_waitcnt
13458 lgkmcnt(0) to allow
13460 independently moved
13463 - s_waitcnt vmcnt(0)
13469 atomicrmw-with-return-value.
13470 - s_waitcnt vscnt(0)
13474 store/store atomic/
13475 atomicrmw-no-return-value.
13476 - s_waitcnt lgkmcnt(0)
13483 - Must happen before
13488 global/local/generic
13493 and memory ordering
13497 acquire-fence-paired-atomic)
13499 before invalidating
13509 global/local/generic
13514 and memory ordering
13518 release-fence-paired-atomic).
13526 - Must happen before
13540 **Sequential Consistent Atomic**
13541 ------------------------------------------------------------------------------------
13542 load atomic seq_cst - singlethread - global *Same as corresponding
13543 - wavefront - local load atomic acquire,
13544 - generic except must generate
13545 all instructions even
13547 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) &
13548 - generic vmcnt(0) & vscnt(0)
13550 - If CU wavefront execution
13551 mode, omit vmcnt(0) and
13553 - Could be split into
13555 vmcnt(0), s_waitcnt
13556 vscnt(0), and s_waitcnt
13557 lgkmcnt(0) to allow
13559 independently moved
13562 - s_waitcnt lgkmcnt(0) must
13569 ordering of seq_cst
13575 lgkmcnt(0) and so do
13578 - s_waitcnt vmcnt(0)
13581 global/generic load
13583 atomicrmw-with-return-value
13585 ordering of seq_cst
13594 - s_waitcnt vscnt(0)
13597 global/generic store
13599 atomicrmw-no-return-value
13601 ordering of seq_cst
13613 consistent global/local
13614 memory instructions
13620 prevents reordering
13623 seq_cst load. (Note
13629 followed by a store
13636 release followed by
13639 order. The s_waitcnt
13640 could be placed after
13641 seq_store or before
13644 make the s_waitcnt be
13645 as late as possible
13651 instructions same as
13654 except must generate
13655 all instructions even
13657 load atomic seq_cst - workgroup - local
13659 1. s_waitcnt vmcnt(0) & vscnt(0)
13661 - If CU wavefront execution
13663 - Could be split into
13665 vmcnt(0) and s_waitcnt
13668 independently moved
13671 - s_waitcnt vmcnt(0)
13674 global/generic load
13676 atomicrmw-with-return-value
13678 ordering of seq_cst
13687 - s_waitcnt vscnt(0)
13690 global/generic store
13692 atomicrmw-no-return-value
13694 ordering of seq_cst
13707 memory instructions
13713 prevents reordering
13716 seq_cst load. (Note
13722 followed by a store
13729 release followed by
13732 order. The s_waitcnt
13733 could be placed after
13734 seq_store or before
13737 make the s_waitcnt be
13738 as late as possible
13744 instructions same as
13747 except must generate
13748 all instructions even
13751 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
13752 - system - generic vmcnt(0) & vscnt(0)
13754 - Could be split into
13756 vmcnt(0), s_waitcnt
13757 vscnt(0) and s_waitcnt
13758 lgkmcnt(0) to allow
13760 independently moved
13763 - s_waitcnt lgkmcnt(0)
13770 ordering of seq_cst
13776 lgkmcnt(0) and so do
13779 - s_waitcnt vmcnt(0)
13782 global/generic load
13784 atomicrmw-with-return-value
13786 ordering of seq_cst
13795 - s_waitcnt vscnt(0)
13798 global/generic store
13800 atomicrmw-no-return-value
13802 ordering of seq_cst
13815 memory instructions
13821 prevents reordering
13824 seq_cst load. (Note
13830 followed by a store
13837 release followed by
13840 order. The s_waitcnt
13841 could be placed after
13842 seq_store or before
13845 make the s_waitcnt be
13846 as late as possible
13852 instructions same as
13855 except must generate
13856 all instructions even
13858 store atomic seq_cst - singlethread - global *Same as corresponding
13859 - wavefront - local store atomic release,
13860 - workgroup - generic except must generate
13861 - agent all instructions even
13862 - system for OpenCL.*
13863 atomicrmw seq_cst - singlethread - global *Same as corresponding
13864 - wavefront - local atomicrmw acq_rel,
13865 - workgroup - generic except must generate
13866 - agent all instructions even
13867 - system for OpenCL.*
13868 fence seq_cst - singlethread *none* *Same as corresponding
13869 - wavefront fence acq_rel,
13870 - workgroup except must generate
13871 - agent all instructions even
13872 - system for OpenCL.*
13873 ============ ============ ============== ========== ================================
13875 .. _amdgpu-amdhsa-trap-handler-abi:
13880 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
13881 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
13882 supports the ``s_trap`` instruction. For usage see:
13884 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
13885 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
13886 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
13888 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
13889 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
13891 =================== =============== =============== =======================================
13892 Usage Code Sequence Trap Handler Description
13894 =================== =============== =============== =======================================
13895 reserved ``s_trap 0x00`` Reserved by hardware.
13896 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap``
13897 ``queue_ptr`` intrinsic (not implemented).
13900 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
13901 ``queue_ptr`` the trap instruction. The associated
13902 queue is signalled to put it into the
13903 error state. When the queue is put in
13904 the error state, the waves executing
13905 dispatches on the queue will be
13907 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
13908 as a no-operation. The trap handler
13909 is entered and immediately returns to
13910 continue execution of the wavefront.
13911 - If the debugger is enabled, causes
13912 the debug trap to be reported by the
13913 debugger and the wavefront is put in
13914 the halt state with the PC at the
13915 instruction. The debugger must
13916 increment the PC and resume the wave.
13917 reserved ``s_trap 0x04`` Reserved.
13918 reserved ``s_trap 0x05`` Reserved.
13919 reserved ``s_trap 0x06`` Reserved.
13920 reserved ``s_trap 0x07`` Reserved.
13921 reserved ``s_trap 0x08`` Reserved.
13922 reserved ``s_trap 0xfe`` Reserved.
13923 reserved ``s_trap 0xff`` Reserved.
13924 =================== =============== =============== =======================================
13928 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
13929 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
13931 =================== =============== =============== =======================================
13932 Usage Code Sequence Trap Handler Description
13934 =================== =============== =============== =======================================
13935 reserved ``s_trap 0x00`` Reserved by hardware.
13936 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for
13937 breakpoints. Causes wave to be halted
13938 with the PC at the trap instruction.
13939 The debugger is responsible to resume
13940 the wave, including the instruction
13941 that the breakpoint overwrote.
13942 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
13943 ``queue_ptr`` the trap instruction. The associated
13944 queue is signalled to put it into the
13945 error state. When the queue is put in
13946 the error state, the waves executing
13947 dispatches on the queue will be
13949 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
13950 as a no-operation. The trap handler
13951 is entered and immediately returns to
13952 continue execution of the wavefront.
13953 - If the debugger is enabled, causes
13954 the debug trap to be reported by the
13955 debugger and the wavefront is put in
13956 the halt state with the PC at the
13957 instruction. The debugger must
13958 increment the PC and resume the wave.
13959 reserved ``s_trap 0x04`` Reserved.
13960 reserved ``s_trap 0x05`` Reserved.
13961 reserved ``s_trap 0x06`` Reserved.
13962 reserved ``s_trap 0x07`` Reserved.
13963 reserved ``s_trap 0x08`` Reserved.
13964 reserved ``s_trap 0xfe`` Reserved.
13965 reserved ``s_trap 0xff`` Reserved.
13966 =================== =============== =============== =======================================
13970 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
13971 :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
13973 =================== =============== ================ ================= =======================================
13974 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
13975 =================== =============== ================ ================= =======================================
13976 reserved ``s_trap 0x00`` Reserved by hardware.
13977 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for
13978 breakpoints. Causes wave to be halted
13979 with the PC at the trap instruction.
13980 The debugger is responsible to resume
13981 the wave, including the instruction
13982 that the breakpoint overwrote.
13983 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at
13984 ``queue_ptr`` the trap instruction. The associated
13985 queue is signalled to put it into the
13986 error state. When the queue is put in
13987 the error state, the waves executing
13988 dispatches on the queue will be
13990 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves
13991 as a no-operation. The trap handler
13992 is entered and immediately returns to
13993 continue execution of the wavefront.
13994 - If the debugger is enabled, causes
13995 the debug trap to be reported by the
13996 debugger and the wavefront is put in
13997 the halt state with the PC at the
13998 instruction. The debugger must
13999 increment the PC and resume the wave.
14000 reserved ``s_trap 0x04`` Reserved.
14001 reserved ``s_trap 0x05`` Reserved.
14002 reserved ``s_trap 0x06`` Reserved.
14003 reserved ``s_trap 0x07`` Reserved.
14004 reserved ``s_trap 0x08`` Reserved.
14005 reserved ``s_trap 0xfe`` Reserved.
14006 reserved ``s_trap 0xff`` Reserved.
14007 =================== =============== ================ ================= =======================================
14009 .. _amdgpu-amdhsa-function-call-convention:
14016 This section is currently incomplete and has inaccuracies. It is WIP that will
14017 be updated as information is determined.
14019 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
14020 addresses. Unswizzled addresses are normal linear addresses.
14022 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
14027 This section describes the call convention ABI for the outer kernel function.
14029 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
14032 The following is not part of the AMDGPU kernel calling convention but describes
14033 how the AMDGPU implements function calls:
14035 1. Clang decides the kernarg layout to match the *HSA Programmer's Language
14038 - All structs are passed directly.
14039 - Lambda values are passed *TBA*.
14043 - Does this really follow HSA rules? Or are structs >16 bytes passed
14045 - What is ABI for lambda values?
14047 4. The kernel performs certain setup in its prolog, as described in
14048 :ref:`amdgpu-amdhsa-kernel-prolog`.
14050 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
14052 Non-Kernel Functions
14053 ++++++++++++++++++++
14055 This section describes the call convention ABI for functions other than the
14056 outer kernel function.
14058 If a kernel has function calls then scratch is always allocated and used for
14059 the call stack which grows from low address to high address using the swizzled
14060 scratch address space.
14062 On entry to a function:
14064 1. SGPR0-3 contain a V# with the following properties (see
14065 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
14067 * Base address pointing to the beginning of the wavefront scratch backing
14069 * Swizzled with dword element size and stride of wavefront size elements.
14071 2. The FLAT_SCRATCH register pair is setup. See
14072 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
14073 3. GFX6-GFX8: M0 register set to the size of LDS in bytes. See
14074 :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
14075 4. The EXEC register is set to the lanes active on entry to the function.
14076 5. MODE register: *TBD*
14077 6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
14079 7. SGPR30-31 return address (RA). The code address that the function must
14080 return to when it completes. The value is undefined if the function is *no
14082 8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
14083 offset relative to the beginning of the wavefront scratch backing memory.
14085 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
14086 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
14089 The unswizzled SP value can be converted into the swizzled SP value by:
14091 | swizzled SP = unswizzled SP / wavefront size
14093 This may be used to obtain the private address space address of stack
14094 objects and to convert this address to a flat address by adding the flat
14095 scratch aperture base address.
14097 The swizzled SP value is always 4 bytes aligned for the ``r600``
14098 architecture and 16 byte aligned for the ``amdgcn`` architecture.
14102 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
14103 OpenCL language which has the largest base type defined as 16 bytes.
14105 On entry, the swizzled SP value is the address of the first function
14106 argument passed on the stack. Other stack passed arguments are positive
14107 offsets from the entry swizzled SP value.
14109 The function may use positive offsets beyond the last stack passed argument
14110 for stack allocated local variables and register spill slots. If necessary,
14111 the function may align these to greater alignment than 16 bytes. After these
14112 the function may dynamically allocate space for such things as runtime sized
14113 ``alloca`` local allocations.
14115 If the function calls another function, it will place any stack allocated
14116 arguments after the last local allocation and adjust SGPR32 to the address
14117 after the last local allocation.
14119 9. All other registers are unspecified.
14120 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
14122 11. Use pass-by-reference (byref) in stead of pass-by-value (byval) for struct
14123 arguments in C ABI. Callee is responsible for allocating stack memory and
14124 copying the value of the struct if modified. Note that the backend still
14125 supports byval for struct arguments.
14127 On exit from a function:
14129 1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
14130 described below. Any registers used are considered clobbered registers.
14131 2. The following registers are preserved and have the same value as on entry:
14136 * All SGPR registers except the clobbered registers of SGPR4-31.
14154 Except the argument registers, the VGPRs clobbered and the preserved
14155 registers are intermixed at regular intervals in order to keep a
14156 similar ratio independent of the number of allocated VGPRs.
14158 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
14159 * Lanes of all VGPRs that are inactive at the call site.
14161 For the AMDGPU backend, an inter-procedural register allocation (IPRA)
14162 optimization may mark some of clobbered SGPR and VGPR registers as
14163 preserved if it can be determined that the called function does not change
14166 2. The PC is set to the RA provided on entry.
14167 3. MODE register: *TBD*.
14168 4. All other registers are clobbered.
14169 5. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
14170 function is available to the caller.
14174 - How are function results returned? The address of structured types is passed
14175 by reference, but what about other types?
14177 The function input arguments are made up of the formal arguments explicitly
14178 declared by the source language function plus the implicit input arguments used
14179 by the implementation.
14181 The source language input arguments are:
14183 1. Any source language implicit ``this`` or ``self`` argument comes first as a
14185 2. Followed by the function formal arguments in left to right source order.
14187 The source language result arguments are:
14189 1. The function result argument.
14191 The source language input or result struct type arguments that are less than or
14192 equal to 16 bytes, are decomposed recursively into their base type fields, and
14193 each field is passed as if a separate argument. For input arguments, if the
14194 called function requires the struct to be in memory, for example because its
14195 address is taken, then the function body is responsible for allocating a stack
14196 location and copying the field arguments into it. Clang terms this *direct
14199 The source language input struct type arguments that are greater than 16 bytes,
14200 are passed by reference. The caller is responsible for allocating a stack
14201 location to make a copy of the struct value and pass the address as the input
14202 argument. The called function is responsible to perform the dereference when
14203 accessing the input argument. Clang terms this *by-value struct*.
14205 A source language result struct type argument that is greater than 16 bytes, is
14206 returned by reference. The caller is responsible for allocating a stack location
14207 to hold the result value and passes the address as the last input argument
14208 (before the implicit input arguments). In this case there are no result
14209 arguments. The called function is responsible to perform the dereference when
14210 storing the result value. Clang terms this *structured return (sret)*.
14212 *TODO: correct the ``sret`` definition.*
14216 Is this definition correct? Or is ``sret`` only used if passing in registers, and
14217 pass as non-decomposed struct as stack argument? Or something else? Is the
14218 memory location in the caller stack frame, or a stack memory argument and so
14219 no address is passed as the caller can directly write to the argument stack
14220 location? But then the stack location is still live after return. If an
14221 argument stack location is it the first stack argument or the last one?
14223 Lambda argument types are treated as struct types with an implementation defined
14228 Need to specify the ABI for lambda types for AMDGPU.
14230 For AMDGPU backend all source language arguments (including the decomposed
14231 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
14232 they are passed in SGPRs.
14234 The AMDGPU backend walks the function call graph from the leaves to determine
14235 which implicit input arguments are used, propagating to each caller of the
14236 function. The used implicit arguments are appended to the function arguments
14237 after the source language arguments in the following order:
14241 Is recursion or external functions supported?
14243 1. Work-Item ID (1 VGPR)
14245 The X, Y and Z work-item ID are packed into a single VGRP with the following
14246 layout. Only fields actually used by the function are set. The other bits
14249 The values come from the initial kernel execution state. See
14250 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14252 .. table:: Work-item implicit argument layout
14253 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
14255 ======= ======= ==============
14256 Bits Size Field Name
14257 ======= ======= ==============
14258 9:0 10 bits X Work-Item ID
14259 19:10 10 bits Y Work-Item ID
14260 29:20 10 bits Z Work-Item ID
14261 31:30 2 bits Unused
14262 ======= ======= ==============
14264 2. Dispatch Ptr (2 SGPRs)
14266 The value comes from the initial kernel execution state. See
14267 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14269 3. Queue Ptr (2 SGPRs)
14271 The value comes from the initial kernel execution state. See
14272 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14274 4. Kernarg Segment Ptr (2 SGPRs)
14276 The value comes from the initial kernel execution state. See
14277 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14279 5. Dispatch id (2 SGPRs)
14281 The value comes from the initial kernel execution state. See
14282 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14284 6. Work-Group ID X (1 SGPR)
14286 The value comes from the initial kernel execution state. See
14287 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14289 7. Work-Group ID Y (1 SGPR)
14291 The value comes from the initial kernel execution state. See
14292 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14294 8. Work-Group ID Z (1 SGPR)
14296 The value comes from the initial kernel execution state. See
14297 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14299 9. Implicit Argument Ptr (2 SGPRs)
14301 The value is computed by adding an offset to Kernarg Segment Ptr to get the
14302 global address space pointer to the first kernarg implicit argument.
14304 The input and result arguments are assigned in order in the following manner:
14308 There are likely some errors and omissions in the following description that
14313 Check the Clang source code to decipher how function arguments and return
14314 results are handled. Also see the AMDGPU specific values used.
14316 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
14319 If there are more arguments than will fit in these registers, the remaining
14320 arguments are allocated on the stack in order on naturally aligned
14325 How are overly aligned structures allocated on the stack?
14327 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
14330 If there are more arguments than will fit in these registers, the remaining
14331 arguments are allocated on the stack in order on naturally aligned
14334 Note that decomposed struct type arguments may have some fields passed in
14335 registers and some in memory.
14339 So, a struct which can pass some fields as decomposed register arguments, will
14340 pass the rest as decomposed stack elements? But an argument that will not start
14341 in registers will not be decomposed and will be passed as a non-decomposed
14344 The following is not part of the AMDGPU function calling convention but
14345 describes how the AMDGPU implements function calls:
14347 1. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
14348 unswizzled scratch address. It is only needed if runtime sized ``alloca``
14349 are used, or for the reasons defined in ``SIFrameLowering``.
14350 2. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
14351 to access the incoming stack arguments in the function. The BP is needed
14352 only when the function requires the runtime stack alignment.
14354 3. Allocating SGPR arguments on the stack are not supported.
14356 4. No CFI is currently generated. See
14357 :ref:`amdgpu-dwarf-call-frame-information`.
14361 CFI will be generated that defines the CFA as the unswizzled address
14362 relative to the wave scratch base in the unswizzled private address space
14363 of the lowest address stack allocated local variable.
14365 ``DW_AT_frame_base`` will be defined as the swizzled address in the
14366 swizzled private address space by dividing the CFA by the wavefront size
14367 (since CFA is always at least dword aligned which matches the scratch
14368 swizzle element size).
14370 If no dynamic stack alignment was performed, the stack allocated arguments
14371 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
14372 local variables and register spill slots are accessed as positive offsets
14373 relative to ``DW_AT_frame_base``.
14375 5. Function argument passing is implemented by copying the input physical
14376 registers to virtual registers on entry. The register allocator can spill if
14377 necessary. These are copied back to physical registers at call sites. The
14378 net effect is that each function call can have these values in entirely
14379 distinct locations. The IPRA can help avoid shuffling argument registers.
14380 6. Call sites are implemented by setting up the arguments at positive offsets
14381 from SP. Then SP is incremented to account for the known frame size before
14382 the call and decremented after the call.
14386 The CFI will reflect the changed calculation needed to compute the CFA
14389 7. 4 byte spill slots are used in the stack frame. One slot is allocated for an
14390 emergency spill slot. Buffer instructions are used for stack accesses and
14391 not the ``flat_scratch`` instruction.
14395 Explain when the emergency spill slot is used.
14399 Possible broken issues:
14401 - Stack arguments must be aligned to required alignment.
14402 - Stack is aligned to max(16, max formal argument alignment)
14403 - Direct argument < 64 bits should check register budget.
14404 - Register budget calculation should respect ``inreg`` for SGPR.
14405 - SGPR overflow is not handled.
14406 - struct with 1 member unpeeling is not checking size of member.
14407 - ``sret`` is after ``this`` pointer.
14408 - Caller is not implementing stack realignment: need an extra pointer.
14409 - Should say AMDGPU passes FP rather than SP.
14410 - Should CFI define CFA as address of locals or arguments. Difference is
14411 apparent when have implemented dynamic alignment.
14412 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
14413 highest address of stack frame and use negative offset for locals. Would
14414 allow SP to be the same as FP and could support signal-handler-like as now
14415 have a real SP for the top of the stack.
14416 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
14422 This section provides code conventions used when the target triple OS is
14423 ``amdpal`` (see :ref:`amdgpu-target-triples`).
14425 .. _amdgpu-amdpal-code-object-metadata-section:
14427 Code Object Metadata
14428 ~~~~~~~~~~~~~~~~~~~~
14432 The metadata is currently in development and is subject to major
14433 changes. Only the current version is supported. *When this document
14434 was generated the version was 2.6.*
14436 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
14437 record (see :ref:`amdgpu-note-records-v3-onwards`).
14439 The metadata is represented as Message Pack formatted binary data (see
14440 [MsgPack]_). The top level is a Message Pack map that includes the keys
14441 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
14442 and referenced tables.
14444 Additional information can be added to the maps. To avoid conflicts, any
14445 key names should be prefixed by "*vendor-name*." where ``vendor-name``
14446 can be the name of the vendor and specific vendor tool that generates the
14447 information. The prefix is abbreviated to simply "." when it appears
14448 within a map that has been added by the same *vendor-name*.
14450 .. table:: AMDPAL Code Object Metadata Map
14451 :name: amdgpu-amdpal-code-object-metadata-map-table
14453 =================== ============== ========= ======================================================================
14454 String Key Value Type Required? Description
14455 =================== ============== ========= ======================================================================
14456 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values
14457 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
14458 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See
14459 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
14460 definition of the keys included in that map.
14461 =================== ============== ========= ======================================================================
14465 .. table:: AMDPAL Code Object Pipeline Metadata Map
14466 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
14468 ====================================== ============== ========= ===================================================
14469 String Key Value Type Required? Description
14470 ====================================== ============== ========= ===================================================
14471 ".name" string Source name of the pipeline.
14472 ".type" string Pipeline type, e.g. VsPs. Values include:
14482 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower
14483 2 integers 64 bits is the "stable" portion of the hash, used
14484 for e.g. shader replacement lookup. Upper 64 bits
14485 is the "unique" portion of the hash, used for
14486 e.g. pipeline cache lookup. The value is
14487 implementation defined, and can not be relied on
14488 between different builds of the compiler.
14489 ".shaders" map Per-API shader metadata. See
14490 :ref:`amdgpu-amdpal-code-object-shader-map-table`
14491 for the definition of the keys included in that
14493 ".hardware_stages" map Per-hardware stage metadata. See
14494 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
14495 for the definition of the keys included in that
14497 ".shader_functions" map Per-shader function metadata. See
14498 :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
14499 for the definition of the keys included in that
14501 ".registers" map Required Hardware register configuration. See
14502 :ref:`amdgpu-amdpal-code-object-register-map-table`
14503 for the definition of the keys included in that
14505 ".user_data_limit" integer Number of user data entries accessed by this
14507 ".spill_threshold" integer The user data spill threshold. 0xFFFF for
14508 NoUserDataSpilling.
14509 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the
14510 viewport array index feature. Pipelines which use
14511 this feature can render into all 16 viewports,
14512 whereas pipelines which do not use it are
14513 restricted to viewport #0.
14514 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for
14515 handling data-passing between the ES and GS
14516 shader stages. This can be zero if the data is
14517 passed using off-chip buffers. This value should
14518 be used to program all user-SGPRs which have been
14519 marked with "UserDataMapping::EsGsLdsSize"
14520 (typically only the GS and VS HW stages will ever
14521 have a user-SGPR so marked).
14522 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders
14523 (maximum number of threads in a subgroup).
14524 ".num_interpolants" integer Graphics only. Number of PS interpolants.
14525 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used.
14526 ".api" string Name of the client graphics API.
14527 ".api_create_info" binary Graphics API shader create info binary blob. Can
14528 be defined by the driver using the compiler if
14529 they want to be able to correlate API-specific
14530 information used during creation at a later time.
14531 ====================================== ============== ========= ===================================================
14535 .. table:: AMDPAL Code Object Shader Map
14536 :name: amdgpu-amdpal-code-object-shader-map-table
14539 +-------------+--------------+-------------------------------------------------------------------+
14540 |String Key |Value Type |Description |
14541 +=============+==============+===================================================================+
14542 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
14543 |- ".vertex" | |for the definition of the keys included in that map. |
14546 |- ".geometry"| | |
14548 +-------------+--------------+-------------------------------------------------------------------+
14552 .. table:: AMDPAL Code Object API Shader Metadata Map
14553 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
14555 ==================== ============== ========= =====================================================================
14556 String Key Value Type Required? Description
14557 ==================== ============== ========= =====================================================================
14558 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value
14559 2 integers is implementation defined, and can not be relied on between
14560 different builds of the compiler.
14561 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values
14572 ==================== ============== ========= =====================================================================
14576 .. table:: AMDPAL Code Object Hardware Stage Map
14577 :name: amdgpu-amdpal-code-object-hardware-stage-map-table
14579 +-------------+--------------+-----------------------------------------------------------------------+
14580 |String Key |Value Type |Description |
14581 +=============+==============+=======================================================================+
14582 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
14583 |- ".hs" | |for the definition of the keys included in that map. |
14589 +-------------+--------------+-----------------------------------------------------------------------+
14593 .. table:: AMDPAL Code Object Hardware Stage Metadata Map
14594 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
14596 ========================== ============== ========= ===============================================================
14597 String Key Value Type Required? Description
14598 ========================== ============== ========= ===============================================================
14599 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
14600 ".scratch_memory_size" integer Scratch memory size in bytes.
14601 ".lds_size" integer Local Data Share size in bytes.
14602 ".perf_data_buffer_size" integer Performance data buffer size in bytes.
14603 ".vgpr_count" integer Number of VGPRs used.
14604 ".agpr_count" integer Number of AGPRs used.
14605 ".sgpr_count" integer Number of SGPRs used.
14606 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
14607 directive to instruct the compiler to limit the VGPR usage to
14608 be less than or equal to the specified value (only set if
14609 different from HW default).
14610 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW
14612 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
14614 ".wavefront_size" integer Wavefront size (only set if different from HW default).
14615 ".uses_uavs" boolean The shader reads or writes UAVs.
14616 ".uses_rovs" boolean The shader reads or writes ROVs.
14617 ".writes_uavs" boolean The shader writes to one or more UAVs.
14618 ".writes_depth" boolean The shader writes out a depth value.
14619 ".uses_append_consume" boolean The shader uses append and/or consume operations, either
14621 ".uses_prim_id" boolean The shader uses PrimID.
14622 ========================== ============== ========= ===============================================================
14626 .. table:: AMDPAL Code Object Shader Function Map
14627 :name: amdgpu-amdpal-code-object-shader-function-map-table
14629 =============== ============== ====================================================================
14630 String Key Value Type Description
14631 =============== ============== ====================================================================
14632 *symbol name* map *symbol name* is the ELF symbol name of the shader function code
14633 entry address. The value is the function's metadata. See
14634 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
14635 =============== ============== ====================================================================
14639 .. table:: AMDPAL Code Object Shader Function Metadata Map
14640 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
14642 ============================= ============== =================================================================
14643 String Key Value Type Description
14644 ============================= ============== =================================================================
14645 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value
14646 2 integers is implementation defined, and can not be relied on between
14647 different builds of the compiler.
14648 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader.
14649 ".lds_size" integer Size in bytes of LDS memory.
14650 ".vgpr_count" integer Number of VGPRs used by the shader.
14651 ".sgpr_count" integer Number of SGPRs used by the shader.
14652 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader.
14653 ".shader_subtype" string Shader subtype/kind. Values include:
14657 ============================= ============== =================================================================
14661 .. table:: AMDPAL Code Object Register Map
14662 :name: amdgpu-amdpal-code-object-register-map-table
14664 ========================== ============== ====================================================================
14665 32-bit Integer Key Value Type Description
14666 ========================== ============== ====================================================================
14667 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
14668 a GRBM register (i.e., driver accessible GPU register number, not
14669 shader GPR register number). The driver is required to program each
14670 specified register to the corresponding specified value when
14671 executing this pipeline. Typically, the ``reg offsets`` are the
14672 ``uint16_t`` offsets to each register as defined by the hardware
14673 chip headers. The register is set to the provided value. However, a
14674 ``reg offset`` that specifies a user data register (e.g.,
14675 COMPUTE_USER_DATA_0) needs special treatment. See
14676 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
14678 ========================== ============== ====================================================================
14680 .. _amdgpu-amdpal-code-object-user-data-section:
14685 Each hardware stage has a set of 32-bit physical SPI *user data registers*
14686 (either 16 or 32 based on graphics IP and the stage) which can be
14687 written from a command buffer and then loaded into SGPRs when waves are
14688 launched via a subsequent dispatch or draw operation. This is the way
14689 most arguments are passed from the application/runtime to a hardware
14692 PAL abstracts this functionality by exposing a set of 128 *user data
14693 entries* per pipeline a client can use to pass arguments from a command
14694 buffer to one or more shaders in that pipeline. The ELF code object must
14695 specify a mapping from virtualized *user data entries* to physical *user
14696 data registers*, and PAL is responsible for implementing that mapping,
14697 including spilling overflow *user data entries* to memory if needed.
14699 Since the *user data registers* are GRBM-accessible SPI registers, this
14700 mapping is actually embedded in the ``.registers`` metadata entry. For
14701 most registers, the value in that map is a literal 32-bit value that
14702 should be written to the register by the driver. However, when the
14703 register is a *user data register* (any USER_DATA register e.g.,
14704 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
14705 the driver to write either a *user data entry* value or one of several
14706 driver-internal values to the register. This encoding is described in
14707 the following table:
14711 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
14712 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
14713 always be programmed to the address of the GlobalTable, and *user data
14714 register* 1 must always be programmed to the address of the PerShaderTable.
14718 .. table:: AMDPAL User Data Mapping
14719 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14721 ========== ================= ===============================================================================
14722 Value Name Description
14723 ========== ================= ===============================================================================
14724 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14725 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should
14726 always point to *user data register* 0).
14727 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See
14728 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14729 for more detail (should always point to *user data register* 1).
14730 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See
14731 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14733 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14734 reference the draw index in the vertex shader. Only supported by the first
14735 stage in a graphics pipeline.
14736 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in
14737 a graphics pipeline.
14738 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a
14740 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
14741 a buffer containing the grid dimensions for a Compute dispatch operation. The
14742 high half of the address is stored in the next sequential user-SGPR. Only
14743 supported by compute pipelines.
14744 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS
14745 space used for the ES/GS pseudo-ring-buffer for passing data between shader
14747 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic
14748 pipeline instancing.
14749 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This
14750 can only appear for one shader stage per pipeline.
14751 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer.
14752 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can
14753 only appear for one shader stage per pipeline.
14754 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can
14755 only appear for one shader stage per pipeline (PS). These replace color targets
14756 and are completely separate from any UAVs used by the shader. This is optional,
14757 and only used by the PS when UAV exports are used to replace color-target
14758 exports to optimize specific shaders.
14759 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by
14760 some NGG pipelines to perform culling. This value contains the address of the
14761 first of two consecutive registers which provide the full GPU address.
14762 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine.
14763 ========== ================= ===============================================================================
14765 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
14770 Low 32 bits of the GPU address for an optional buffer in the ``.data``
14771 section of the ELF. The high 32 bits of the address match the high 32 bits
14772 of the shader's program counter.
14774 The buffer can be anything the shader compiler needs it for, and
14775 allows each shader to have its own region of the ``.data`` section.
14776 Typically, this could be a table of buffer SRD's and the data pointed to
14777 by the buffer SRD's, but it could be a flat-address region of memory as
14778 well. Its layout and usage are defined by the shader compiler.
14780 Each shader's table in the ``.data`` section is referenced by the symbol
14781 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the
14782 hardware shader stage the data is for. E.g.,
14783 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
14785 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
14790 It is possible for a hardware shader to need access to more *user data
14791 entries* than there are slots available in user data registers for one
14792 or more hardware shader stages. In that case, the PAL runtime expects
14793 the necessary *user data entries* to be spilled to GPU memory and use
14794 one user data register to point to the spilled user data memory. The
14795 value of the *user data entry* must then represent the location where
14796 a shader expects to read the low 32-bits of the table's GPU virtual
14797 address. The *spill table* itself represents a set of 32-bit values
14798 managed by the PAL runtime in GPU-accessible memory that can be made
14799 indirectly accessible to a hardware shader.
14804 This section provides code conventions used when the target triple OS is
14805 empty (see :ref:`amdgpu-target-triples`).
14810 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
14811 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
14812 instructions are handled as follows:
14814 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
14815 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
14817 =============== =============== ===========================================
14818 Usage Code Sequence Description
14819 =============== =============== ===========================================
14820 llvm.trap s_endpgm Causes wavefront to be terminated.
14821 llvm.debugtrap *none* Compiler warning given that there is no
14822 trap handler installed.
14823 =============== =============== ===========================================
14833 When the language is OpenCL the following differences occur:
14835 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14836 2. The AMDGPU backend appends additional arguments to the kernel's explicit
14837 arguments for the AMDHSA OS (see
14838 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
14839 3. Additional metadata is generated
14840 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
14842 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
14843 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
14845 ======== ==== ========= ===========================================
14846 Position Byte Byte Description
14848 ======== ==== ========= ===========================================
14849 1 8 8 OpenCL Global Offset X
14850 2 8 8 OpenCL Global Offset Y
14851 3 8 8 OpenCL Global Offset Z
14852 4 8 8 OpenCL address of printf buffer
14853 5 8 8 OpenCL address of virtual queue used by
14855 6 8 8 OpenCL address of AqlWrap struct used by
14857 7 8 8 Pointer argument used for Multi-gird
14859 ======== ==== ========= ===========================================
14866 When the language is HCC the following differences occur:
14868 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14870 .. _amdgpu-assembler:
14875 AMDGPU backend has LLVM-MC based assembler which is currently in development.
14876 It supports AMDGCN GFX6-GFX11.
14878 This section describes general syntax for instructions and operands.
14883 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
14885 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
14886 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
14888 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
14889 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
14891 The order of operands and modifiers is fixed.
14892 Most modifiers are optional and may be omitted.
14894 Links to detailed instruction syntax description may be found in the following
14895 table. Note that features under development are not included
14896 in this description.
14898 ============= ============================================= =======================================
14899 Architecture Core ISA ISA Variants and Extensions
14900 ============= ============================================= =======================================
14901 GCN 2 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \-
14902 GCN 3, GCN 4 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \-
14903 GCN 5 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
14905 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
14907 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
14909 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
14911 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
14913 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
14915 CDNA 1 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
14917 CDNA 2 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
14919 CDNA 3 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
14921 :doc:`gfx941<AMDGPU/AMDGPUAsmGFX940>`
14923 :doc:`gfx942<AMDGPU/AMDGPUAsmGFX940>`
14925 RDNA 1 :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
14927 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
14929 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
14931 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
14933 RDNA 2 :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>` :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
14935 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
14937 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
14939 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
14941 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
14943 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
14945 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
14947 RDNA 3 :doc:`GFX11<AMDGPU/AMDGPUAsmGFX11>` :doc:`gfx1100<AMDGPU/AMDGPUAsmGFX11>`
14949 :doc:`gfx1101<AMDGPU/AMDGPUAsmGFX11>`
14951 :doc:`gfx1102<AMDGPU/AMDGPUAsmGFX11>`
14953 :doc:`gfx1103<AMDGPU/AMDGPUAsmGFX11>`
14954 ============= ============================================= =======================================
14956 For more information about instructions, their semantics and supported
14957 combinations of operands, refer to one of instruction set architecture manuals
14958 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
14959 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
14960 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_,
14961 [AMD-GCN-GFX940-GFX942-CDNA3]_, [AMD-GCN-GFX10-RDNA1]_, [AMD-GCN-GFX10-RDNA2]_
14962 and [AMD-GCN-GFX11-RDNA3]_.
14967 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
14972 Detailed description of modifiers may be found
14973 :doc:`here<AMDGPUModifierSyntax>`.
14975 Instruction Examples
14976 ~~~~~~~~~~~~~~~~~~~~
14981 .. code-block:: nasm
14983 ds_add_u32 v2, v4 offset:16
14984 ds_write_src2_b64 v2 offset0:4 offset1:8
14985 ds_cmpst_f32 v2, v4, v6
14986 ds_min_rtn_f64 v[8:9], v2, v[4:5]
14988 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
14994 .. code-block:: nasm
14996 flat_load_dword v1, v[3:4]
14997 flat_store_dwordx3 v[3:4], v[5:7]
14998 flat_atomic_swap v1, v[3:4], v5 glc
14999 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
15000 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
15002 For full list of supported instructions, refer to "FLAT instructions" in ISA
15008 .. code-block:: nasm
15010 buffer_load_dword v1, off, s[4:7], s1
15011 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
15012 buffer_store_format_xy v[1:2], off, s[4:7], s1
15014 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
15016 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
15022 .. code-block:: nasm
15024 s_load_dword s1, s[2:3], 0xfc
15025 s_load_dwordx8 s[8:15], s[2:3], s4
15026 s_load_dwordx16 s[88:103], s[2:3], s4
15030 For full list of supported instructions, refer to "Scalar Memory Operations" in
15036 .. code-block:: nasm
15039 s_mov_b64 s[0:1], 0x80000000
15041 s_wqm_b64 s[2:3], s[4:5]
15042 s_bcnt0_i32_b64 s1, s[2:3]
15043 s_swappc_b64 s[2:3], s[4:5]
15044 s_cbranch_join s[4:5]
15046 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
15052 .. code-block:: nasm
15054 s_add_u32 s1, s2, s3
15055 s_and_b64 s[2:3], s[4:5], s[6:7]
15056 s_cselect_b32 s1, s2, s3
15057 s_andn2_b32 s2, s4, s6
15058 s_lshr_b64 s[2:3], s[4:5], s6
15059 s_ashr_i32 s2, s4, s6
15060 s_bfm_b64 s[2:3], s4, s6
15061 s_bfe_i64 s[2:3], s[4:5], s6
15062 s_cbranch_g_fork s[4:5], s[6:7]
15064 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
15070 .. code-block:: nasm
15072 s_cmp_eq_i32 s1, s2
15073 s_bitcmp1_b32 s1, s2
15074 s_bitcmp0_b64 s[2:3], s4
15077 For full list of supported instructions, refer to "SOPC Instructions" in ISA
15083 .. code-block:: nasm
15088 s_waitcnt 0 ; Wait for all counters to be 0
15089 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
15090 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
15094 s_sendmsg sendmsg(MSG_INTERRUPT)
15097 For full list of supported instructions, refer to "SOPP Instructions" in ISA
15100 Unless otherwise mentioned, little verification is performed on the operands
15101 of SOPP Instructions, so it is up to the programmer to be familiar with the
15102 range or acceptable values.
15107 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
15108 the assembler will automatically use optimal encoding based on its operands. To
15109 force specific encoding, one can add a suffix to the opcode of the instruction:
15111 * _e32 for 32-bit VOP1/VOP2/VOPC
15112 * _e64 for 64-bit VOP3
15114 * _e64_dpp for VOP3 with DPP
15115 * _sdwa for VOP_SDWA
15117 VOP1/VOP2/VOP3/VOPC examples:
15119 .. code-block:: nasm
15122 v_mov_b32_e32 v1, v2
15124 v_cvt_f64_i32_e32 v[1:2], v2
15125 v_floor_f32_e32 v1, v2
15126 v_bfrev_b32_e32 v1, v2
15127 v_add_f32_e32 v1, v2, v3
15128 v_mul_i32_i24_e64 v1, v2, 3
15129 v_mul_i32_i24_e32 v1, -3, v3
15130 v_mul_i32_i24_e32 v1, -100, v3
15131 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
15132 v_max_f16_e32 v1, v2, v3
15136 .. code-block:: nasm
15138 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
15139 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15140 v_mov_b32 v0, v0 wave_shl:1
15141 v_mov_b32 v0, v0 row_mirror
15142 v_mov_b32 v0, v0 row_bcast:31
15143 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
15144 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15145 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15148 VOP3_DPP examples (Available on GFX11+):
15150 .. code-block:: nasm
15152 v_add_f32_e64_dpp v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7]
15153 v_sqrt_f32_e64_dpp v0, v1 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15154 v_ldexp_f32 v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7]
15158 .. code-block:: nasm
15160 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
15161 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
15162 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
15163 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
15164 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
15166 For full list of supported instructions, refer to "Vector ALU instructions".
15168 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
15170 Code Object V2 Predefined Symbols
15171 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15174 Code object V2 generation is no longer supported by this version of LLVM.
15176 The AMDGPU assembler defines and updates some symbols automatically. These
15177 symbols do not affect code generation.
15179 .option.machine_version_major
15180 +++++++++++++++++++++++++++++
15182 Set to the GFX major generation number of the target being assembled for. For
15183 example, when assembling for a "GFX9" target this will be set to the integer
15184 value "9". The possible GFX major generation numbers are presented in
15185 :ref:`amdgpu-processors`.
15187 .option.machine_version_minor
15188 +++++++++++++++++++++++++++++
15190 Set to the GFX minor generation number of the target being assembled for. For
15191 example, when assembling for a "GFX810" target this will be set to the integer
15192 value "1". The possible GFX minor generation numbers are presented in
15193 :ref:`amdgpu-processors`.
15195 .option.machine_version_stepping
15196 ++++++++++++++++++++++++++++++++
15198 Set to the GFX stepping generation number of the target being assembled for.
15199 For example, when assembling for a "GFX704" target this will be set to the
15200 integer value "4". The possible GFX stepping generation numbers are presented
15201 in :ref:`amdgpu-processors`.
15206 Set to zero each time a
15207 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
15208 encountered. At each instruction, if the current value of this symbol is less
15209 than or equal to the maximum VGPR number explicitly referenced within that
15210 instruction then the symbol value is updated to equal that VGPR number plus
15216 Set to zero each time a
15217 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
15218 encountered. At each instruction, if the current value of this symbol is less
15219 than or equal to the maximum VGPR number explicitly referenced within that
15220 instruction then the symbol value is updated to equal that SGPR number plus
15223 .. _amdgpu-amdhsa-assembler-directives-v2:
15225 Code Object V2 Directives
15226 ~~~~~~~~~~~~~~~~~~~~~~~~~
15229 Code object V2 generation is no longer supported by this version of LLVM.
15231 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
15232 one can specify them with assembler directives.
15234 .hsa_code_object_version major, minor
15235 +++++++++++++++++++++++++++++++++++++
15237 *major* and *minor* are integers that specify the version of the HSA code
15238 object that will be generated by the assembler.
15240 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
15241 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15244 *major*, *minor*, and *stepping* are all integers that describe the instruction
15245 set architecture (ISA) version of the assembly program.
15247 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
15248 "AMD" and *arch* should always be equal to "AMDGPU".
15250 By default, the assembler will derive the ISA version, *vendor*, and *arch*
15251 from the value of the -mcpu option that is passed to the assembler.
15253 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
15255 .amdgpu_hsa_kernel (name)
15256 +++++++++++++++++++++++++
15258 This directives specifies that the symbol with given name is a kernel entry
15259 point (label) and the object should contain corresponding symbol of type
15260 STT_AMDGPU_HSA_KERNEL.
15265 This directive marks the beginning of a list of key / value pairs that are used
15266 to specify the amd_kernel_code_t object that will be emitted by the assembler.
15267 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
15268 amd_kernel_code_t values that are unspecified a default value will be used. The
15269 default value for all keys is 0, with the following exceptions:
15271 - *amd_code_version_major* defaults to 1.
15272 - *amd_kernel_code_version_minor* defaults to 2.
15273 - *amd_machine_kind* defaults to 1.
15274 - *amd_machine_version_major*, *machine_version_minor*, and
15275 *amd_machine_version_stepping* are derived from the value of the -mcpu option
15276 that is passed to the assembler.
15277 - *kernel_code_entry_byte_offset* defaults to 256.
15278 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
15279 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
15280 Note that wavefront size is specified as a power of two, so a value of **n**
15281 means a size of 2^ **n**.
15282 - *call_convention* defaults to -1.
15283 - *kernarg_segment_alignment*, *group_segment_alignment*, and
15284 *private_segment_alignment* default to 4. Note that alignments are specified
15285 as a power of 2, so a value of **n** means an alignment of 2^ **n**.
15286 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
15288 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
15290 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
15292 The *.amd_kernel_code_t* directive must be placed immediately after the
15293 function label and before any instructions.
15295 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
15296 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
15298 .. _amdgpu-amdhsa-assembler-example-v2:
15300 Code Object V2 Example Source Code
15301 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15304 Code object V2 generation is no longer supported by this version of LLVM.
15306 Here is an example of a minimal assembly source file, defining one HSA kernel:
15311 .hsa_code_object_version 1,0
15312 .hsa_code_object_isa
15317 .amdgpu_hsa_kernel hello_world
15322 enable_sgpr_kernarg_segment_ptr = 1
15324 compute_pgm_rsrc1_vgprs = 0
15325 compute_pgm_rsrc1_sgprs = 0
15326 compute_pgm_rsrc2_user_sgpr = 2
15327 compute_pgm_rsrc1_wgp_mode = 0
15328 compute_pgm_rsrc1_mem_ordered = 0
15329 compute_pgm_rsrc1_fwd_progress = 1
15330 .end_amd_kernel_code_t
15332 s_load_dwordx2 s[0:1], s[0:1] 0x0
15333 v_mov_b32 v0, 3.14159
15334 s_waitcnt lgkmcnt(0)
15337 flat_store_dword v[1:2], v0
15340 .size hello_world, .Lfunc_end0-hello_world
15342 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
15344 Code Object V3 and Above Predefined Symbols
15345 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15347 The AMDGPU assembler defines and updates some symbols automatically. These
15348 symbols do not affect code generation.
15350 .amdgcn.gfx_generation_number
15351 +++++++++++++++++++++++++++++
15353 Set to the GFX major generation number of the target being assembled for. For
15354 example, when assembling for a "GFX9" target this will be set to the integer
15355 value "9". The possible GFX major generation numbers are presented in
15356 :ref:`amdgpu-processors`.
15358 .amdgcn.gfx_generation_minor
15359 ++++++++++++++++++++++++++++
15361 Set to the GFX minor generation number of the target being assembled for. For
15362 example, when assembling for a "GFX810" target this will be set to the integer
15363 value "1". The possible GFX minor generation numbers are presented in
15364 :ref:`amdgpu-processors`.
15366 .amdgcn.gfx_generation_stepping
15367 +++++++++++++++++++++++++++++++
15369 Set to the GFX stepping generation number of the target being assembled for.
15370 For example, when assembling for a "GFX704" target this will be set to the
15371 integer value "4". The possible GFX stepping generation numbers are presented
15372 in :ref:`amdgpu-processors`.
15374 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
15376 .amdgcn.next_free_vgpr
15377 ++++++++++++++++++++++
15379 Set to zero before assembly begins. At each instruction, if the current value
15380 of this symbol is less than or equal to the maximum VGPR number explicitly
15381 referenced within that instruction then the symbol value is updated to equal
15382 that VGPR number plus one.
15384 May be used to set the `.amdhsa_next_free_vgpr` directive in
15385 :ref:`amdhsa-kernel-directives-table`.
15387 May be set at any time, e.g. manually set to zero at the start of each kernel.
15389 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
15391 .amdgcn.next_free_sgpr
15392 ++++++++++++++++++++++
15394 Set to zero before assembly begins. At each instruction, if the current value
15395 of this symbol is less than or equal the maximum SGPR number explicitly
15396 referenced within that instruction then the symbol value is updated to equal
15397 that SGPR number plus one.
15399 May be used to set the `.amdhsa_next_free_spgr` directive in
15400 :ref:`amdhsa-kernel-directives-table`.
15402 May be set at any time, e.g. manually set to zero at the start of each kernel.
15404 .. _amdgpu-amdhsa-assembler-directives-v3-onwards:
15406 Code Object V3 and Above Directives
15407 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15409 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
15410 architecture processors, and are not OS-specific. Directives which begin with
15411 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
15412 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
15413 :ref:`amdgpu-processors`.
15415 .. _amdgpu-assembler-directive-amdgcn-target:
15417 .amdgcn_target <target-triple> "-" <target-id>
15418 ++++++++++++++++++++++++++++++++++++++++++++++
15420 Optional directive which declares the ``<target-triple>-<target-id>`` supported
15421 by the containing assembler source file. Used by the assembler to validate
15422 command-line options such as ``-triple``, ``-mcpu``, and
15423 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
15424 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
15428 The target ID syntax used for code object V2 to V3 for this directive differs
15429 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
15431 .. _amdgpu-assembler-directive-amdhsa-code-object-version:
15433 .amdhsa_code_object_version <version>
15434 +++++++++++++++++++++++++++++++++++++
15436 Optional directive which declares the code object version to be generated by the
15437 assembler. If not present, a default value will be used.
15439 .amdhsa_kernel <name>
15440 +++++++++++++++++++++
15442 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
15443 ``<name>.kd``, in the current location of the current section. Only valid when
15444 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
15445 instruction to execute, and does not need to be previously defined.
15447 Marks the beginning of a list of directives used to generate the bytes of a
15448 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
15449 Directives which may appear in this list are described in
15450 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
15451 be valid for the target being assembled for, and cannot be repeated. Directives
15452 support the range of values specified by the field they reference in
15453 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
15454 assumed to have its default value, unless it is marked as "Required", in which
15455 case it is an error to omit the directive. This list of directives is
15456 terminated by an ``.end_amdhsa_kernel`` directive.
15458 .. table:: AMDHSA Kernel Assembler Directives
15459 :name: amdhsa-kernel-directives-table
15461 ======================================================== =================== ============ ===================
15462 Directive Default Supported On Description
15463 ======================================================== =================== ============ ===================
15464 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX12 Controls GROUP_SEGMENT_FIXED_SIZE in
15465 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15466 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX12 Controls PRIVATE_SEGMENT_FIXED_SIZE in
15467 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15468 ``.amdhsa_kernarg_size`` 0 GFX6-GFX12 Controls KERNARG_SIZE in
15469 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15470 ``.amdhsa_user_sgpr_count`` 0 GFX6-GFX12 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
15471 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`
15472 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
15473 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15475 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_DISPATCH_PTR in
15476 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15477 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_QUEUE_PTR in
15478 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15479 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
15480 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15481 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX12 Controls ENABLE_SGPR_DISPATCH_ID in
15482 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15483 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
15484 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15486 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX12 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
15487 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15488 ``.amdhsa_wavefront_size32`` Target GFX10-GFX12 Controls ENABLE_WAVEFRONT_SIZE32 in
15489 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15492 ``.amdhsa_uses_dynamic_stack`` 0 GFX6-GFX12 Controls USES_DYNAMIC_STACK in
15493 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15494 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in
15495 (except :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15497 ``.amdhsa_enable_private_segment`` 0 GFX940, Controls ENABLE_PRIVATE_SEGMENT in
15498 GFX11-GFX12 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15499 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_X in
15500 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15501 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
15502 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15503 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
15504 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15505 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_INFO in
15506 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15507 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX12 Controls ENABLE_VGPR_WORKITEM_ID in
15508 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15509 Possible values are defined in
15510 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
15511 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX12 Maximum VGPR number explicitly referenced, plus one.
15512 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
15513 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15514 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX12 Maximum SGPR number explicitly referenced, plus one.
15515 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15516 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15517 ``.amdhsa_accum_offset`` Required GFX90A, Offset of a first AccVGPR in the unified register file.
15518 GFX940 Used to calculate ACCUM_OFFSET in
15519 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
15520 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX12 Whether the kernel may use the special VCC SGPR.
15521 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15522 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15523 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access
15524 (except scratch memory. Used to calculate
15525 GFX940) GRANULATED_WAVEFRONT_SGPR_COUNT in
15526 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15527 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay.
15528 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15529 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15531 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX12 Controls FLOAT_ROUND_MODE_32 in
15532 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15533 Possible values are defined in
15534 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15535 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX12 Controls FLOAT_ROUND_MODE_16_64 in
15536 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15537 Possible values are defined in
15538 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15539 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX12 Controls FLOAT_DENORM_MODE_32 in
15540 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15541 Possible values are defined in
15542 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15543 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX12 Controls FLOAT_DENORM_MODE_16_64 in
15544 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15545 Possible values are defined in
15546 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15547 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX11 Controls ENABLE_DX10_CLAMP in
15548 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15549 ``.amdhsa_ieee_mode`` 1 GFX6-GFX11 Controls ENABLE_IEEE_MODE in
15550 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15551 ``.amdhsa_round_robin_scheduling`` 0 GFX12 Controls ENABLE_WG_RR_EN in
15552 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15553 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX12 Controls FP16_OVFL in
15554 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15555 ``.amdhsa_tg_split`` Target GFX90A, Controls TG_SPLIT in
15556 Feature GFX940, :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
15557 Specific GFX11-GFX12
15559 ``.amdhsa_workgroup_processor_mode`` Target GFX10-GFX12 Controls ENABLE_WGP_MODE in
15560 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15563 ``.amdhsa_memory_ordered`` 1 GFX10-GFX12 Controls MEM_ORDERED in
15564 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15565 ``.amdhsa_forward_progress`` 0 GFX10-GFX12 Controls FWD_PROGRESS in
15566 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15567 ``.amdhsa_shared_vgpr_count`` 0 GFX10-GFX11 Controls SHARED_VGPR_COUNT in
15568 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
15569 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
15570 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15571 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
15572 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15573 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
15574 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15575 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
15576 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15577 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
15578 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15579 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
15580 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15581 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
15582 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15583 ``.amdhsa_user_sgpr_kernarg_preload_length`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_LENGTH in
15584 GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15585 ``.amdhsa_user_sgpr_kernarg_preload_offset`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_OFFSET in
15586 GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15587 ======================================================== =================== ============ ===================
15592 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
15593 note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
15595 The contents must be in the [YAML]_ markup format, with the same structure and
15596 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
15597 :ref:`amdgpu-amdhsa-code-object-metadata-v4` or
15598 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
15600 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
15602 .. _amdgpu-amdhsa-assembler-example-v3-onwards:
15604 Code Object V3 and Above Example Source Code
15605 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15607 Here is an example of a minimal assembly source file, defining one HSA kernel:
15612 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15617 .type hello_world,@function
15619 s_load_dwordx2 s[0:1], s[0:1] 0x0
15620 v_mov_b32 v0, 3.14159
15621 s_waitcnt lgkmcnt(0)
15624 flat_store_dword v[1:2], v0
15627 .size hello_world, .Lfunc_end0-hello_world
15631 .amdhsa_kernel hello_world
15632 .amdhsa_user_sgpr_kernarg_segment_ptr 1
15633 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15634 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15643 - .name: hello_world
15644 .symbol: hello_world.kd
15645 .kernarg_segment_size: 48
15646 .group_segment_fixed_size: 0
15647 .private_segment_fixed_size: 0
15648 .kernarg_segment_align: 4
15649 .wavefront_size: 64
15652 .max_flat_workgroup_size: 256
15656 .value_kind: global_buffer
15657 .address_space: global
15658 .actual_access: write_only
15660 .end_amdgpu_metadata
15662 This kernel is equivalent to the following HIP program:
15667 __global__ void hello_world(float *p) {
15671 If an assembly source file contains multiple kernels and/or functions, the
15672 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
15673 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
15674 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
15675 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
15676 to group the function with the kernel that calls it and reset the symbols
15677 between the two connected components:
15682 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15684 // gpr tracking symbols are implicitly set to zero
15689 .type kern0,@function
15694 .size kern0, .Lkern0_end-kern0
15698 .amdhsa_kernel kern0
15700 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15701 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15704 // reset symbols to begin tracking usage in func1 and kern1
15705 .set .amdgcn.next_free_vgpr, 0
15706 .set .amdgcn.next_free_sgpr, 0
15712 .type func1,@function
15715 s_setpc_b64 s[30:31]
15717 .size func1, .Lfunc1_end-func1
15721 .type kern1,@function
15725 s_add_u32 s4, s4, func1@rel32@lo+4
15726 s_addc_u32 s5, s5, func1@rel32@lo+4
15727 s_swappc_b64 s[30:31], s[4:5]
15731 .size kern1, .Lkern1_end-kern1
15735 .amdhsa_kernel kern1
15737 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15738 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15741 These symbols cannot identify connected components in order to automatically
15742 track the usage for each kernel. However, in some cases careful organization of
15743 the kernels and functions in the source file means there is minimal additional
15744 effort required to accurately calculate GPR usage.
15746 Additional Documentation
15747 ========================
15749 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
15750 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
15751 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
15752 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
15753 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
15754 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
15755 .. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
15756 .. [AMD-GCN-GFX940-GFX942-CDNA3] `AMD Instinct MI300 Instruction Set Architecture <https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf>`__
15757 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
15758 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
15759 .. [AMD-GCN-GFX11-RDNA3] `AMD RDNA 3 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf>`__
15760 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
15761 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
15762 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
15763 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
15764 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
15765 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
15766 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
15767 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
15768 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
15769 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
15770 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
15771 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
15772 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
15773 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
15774 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
15775 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__