1 =============================
2 User Guide for AMDGPU Backend
3 =============================
14 AMDGPU/AMDGPUAsmGFX900
15 AMDGPU/AMDGPUAsmGFX904
16 AMDGPU/AMDGPUAsmGFX906
17 AMDGPU/AMDGPUAsmGFX908
18 AMDGPU/AMDGPUAsmGFX90a
19 AMDGPU/AMDGPUAsmGFX940
21 AMDGPU/AMDGPUAsmGFX1011
22 AMDGPU/AMDGPUAsmGFX1013
23 AMDGPU/AMDGPUAsmGFX1030
27 AMDGPUInstructionSyntax
28 AMDGPUInstructionNotation
29 AMDGPUDwarfExtensionsForHeterogeneousDebugging
30 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
35 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
36 R600 family up until the current GCN families. It lives in the
37 ``llvm/lib/Target/AMDGPU`` directory.
42 .. _amdgpu-target-triples:
47 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
48 to specify the target triple:
50 .. table:: AMDGPU Architectures
51 :name: amdgpu-architecture-table
53 ============ ==============================================================
54 Architecture Description
55 ============ ==============================================================
56 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
57 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
58 ============ ==============================================================
60 .. table:: AMDGPU Vendors
61 :name: amdgpu-vendor-table
63 ============ ==============================================================
65 ============ ==============================================================
66 ``amd`` Can be used for all AMD GPU usage.
67 ``mesa3d`` Can be used if the OS is ``mesa3d``.
68 ============ ==============================================================
70 .. table:: AMDGPU Operating Systems
73 ============== ============================================================
75 ============== ============================================================
76 *<empty>* Defaults to the *unknown* OS.
77 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
80 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
81 loader on Linux. See *AMD ROCm Platform Release Notes*
82 [AMD-ROCm-Release-Notes]_ for supported hardware and
84 - AMD's PAL runtime using the *pal-amdhsa* loader on
87 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL
88 runtime using the *pal-amdpal* loader on Windows and Linux
90 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa
91 3D runtime using the *mesa-mesa3d* loader on Linux.
92 ============== ============================================================
94 .. table:: AMDGPU Environments
95 :name: amdgpu-environment-table
97 ============ ==============================================================
98 Environment Description
99 ============ ==============================================================
101 ============ ==============================================================
103 .. _amdgpu-processors:
108 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
109 specify the AMDGPU processor together with optional target features. See
110 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
111 specific information.
113 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
115 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
118 .. table:: AMDGPU Processors
119 :name: amdgpu-processor-table
121 =========== =============== ============ ===== ================= =============== =============== ======================
122 Processor Alternative Target dGPU/ Target Target OS Support Example
123 Processor Triple APU Features Properties *(see* Products
124 Architecture Supported `amdgpu-os`_
133 =========== =============== ============ ===== ================= =============== =============== ======================
134 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
135 -----------------------------------------------------------------------------------------------------------------------
136 ``r600`` ``r600`` dGPU - Does not
141 ``r630`` ``r600`` dGPU - Does not
146 ``rs880`` ``r600`` dGPU - Does not
151 ``rv670`` ``r600`` dGPU - Does not
156 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
157 -----------------------------------------------------------------------------------------------------------------------
158 ``rv710`` ``r600`` dGPU - Does not
163 ``rv730`` ``r600`` dGPU - Does not
168 ``rv770`` ``r600`` dGPU - Does not
173 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
174 -----------------------------------------------------------------------------------------------------------------------
175 ``cedar`` ``r600`` dGPU - Does not
180 ``cypress`` ``r600`` dGPU - Does not
185 ``juniper`` ``r600`` dGPU - Does not
190 ``redwood`` ``r600`` dGPU - Does not
195 ``sumo`` ``r600`` dGPU - Does not
200 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
201 -----------------------------------------------------------------------------------------------------------------------
202 ``barts`` ``r600`` dGPU - Does not
207 ``caicos`` ``r600`` dGPU - Does not
212 ``cayman`` ``r600`` dGPU - Does not
217 ``turks`` ``r600`` dGPU - Does not
222 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
223 -----------------------------------------------------------------------------------------------------------------------
224 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
229 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
234 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
239 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
240 -----------------------------------------------------------------------------------------------------------------------
241 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000
242 flat - *pal-amdhsa* - A6 Pro-7050B
243 scratch - *pal-amdpal* - A8-7100
251 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100
252 flat - *pal-amdhsa* - FirePro W9100
253 scratch - *pal-amdpal* - FirePro S9150
255 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290
256 flat - *pal-amdhsa* - Radeon R9 290x
257 scratch - *pal-amdpal* - Radeon R390
259 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100
260 - ``mullins`` flat - *pal-amdpal* - E1-2200
268 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790
269 flat - *pal-amdpal* - Radeon HD 8770
272 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA*
279 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
280 -----------------------------------------------------------------------------------------------------------------------
281 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P
282 flat - *pal-amdhsa* - Pro A6-8500B
283 scratch - *pal-amdpal* - A8-8600P
299 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285
300 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380
301 scratch - *pal-amdpal* - Radeon R9 385
302 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano
303 - *pal-amdhsa* - Radeon R9 Fury
304 - *pal-amdpal* - Radeon R9 FuryX
307 - Radeon Instinct MI8
308 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470
309 flat - *pal-amdhsa* - Radeon RX 480
310 scratch - *pal-amdpal* - Radeon Instinct MI6
311 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460
313 scratch - *pal-amdpal*
314 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150
315 flat - *pal-amdhsa* - FirePro S7100
316 scratch - *pal-amdpal* - FirePro W7100
319 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA*
321 scratch - *pal-amdpal* .. TODO::
326 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_
327 -----------------------------------------------------------------------------------------------------------------------
328 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega
329 flat - *pal-amdhsa* Frontier Edition
330 scratch - *pal-amdpal* - Radeon RX Vega 56
334 - Radeon Instinct MI25
335 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G
336 flat - *pal-amdhsa* - Ryzen 5 2400G
337 scratch - *pal-amdpal*
338 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA*
340 - *pal-amdpal* .. TODO::
345 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50
346 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60
347 scratch - *pal-amdpal* - Radeon VII
349 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
353 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA*
360 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA*
362 - xnack scratch .. TODO::
363 - kernarg preload - Packed
364 work-item Add product
367 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G
368 flat - Ryzen 7 4700GE
369 scratch - Ryzen 5 4600G
381 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
383 - xnack scratch .. TODO::
384 - kernarg preload - Packed
385 work-item Add product
388 ``gfx941`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
390 - xnack scratch .. TODO::
391 - kernarg preload - Packed
392 work-item Add product
395 ``gfx942`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
397 - xnack scratch .. TODO::
398 - kernarg preload - Packed
399 work-item Add product
402 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
403 -----------------------------------------------------------------------------------------------------------------------
404 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700
405 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT
406 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT
408 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520
409 - wavefrontsize64 - Absolute - *pal-amdhsa*
410 - xnack flat - *pal-amdpal*
412 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500
413 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT
414 - xnack scratch - *pal-amdpal*
415 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA*
416 - wavefrontsize64 flat - *pal-amdhsa*
417 - xnack scratch - *pal-amdpal* .. TODO::
422 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
423 -----------------------------------------------------------------------------------------------------------------------
424 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800
425 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT
426 scratch - *pal-amdpal* - Radeon RX 6900 XT
427 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT
428 - wavefrontsize64 flat - *pal-amdhsa*
429 scratch - *pal-amdpal*
430 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA*
431 - wavefrontsize64 flat - *pal-amdhsa*
432 scratch - *pal-amdpal* .. TODO::
437 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
438 - wavefrontsize64 flat
443 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA*
444 - wavefrontsize64 flat
450 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
451 - wavefrontsize64 flat
456 ``gfx1036`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
457 - wavefrontsize64 flat
463 **GCN GFX11 (RDNA 3)** [AMD-GCN-GFX11-RDNA3]_
464 -----------------------------------------------------------------------------------------------------------------------
465 ``gfx1100`` ``amdgcn`` dGPU - cumode - Architected - *pal-amdpal* *TBA*
466 - wavefrontsize64 flat
469 work-item Add product
472 ``gfx1101`` ``amdgcn`` dGPU - cumode - Architected *TBA*
473 - wavefrontsize64 flat
476 work-item Add product
479 ``gfx1102`` ``amdgcn`` dGPU - cumode - Architected *TBA*
480 - wavefrontsize64 flat
483 work-item Add product
486 ``gfx1103`` ``amdgcn`` APU - cumode - Architected *TBA*
487 - wavefrontsize64 flat
490 work-item Add product
493 ``gfx1150`` ``amdgcn`` APU - cumode - Architected *TBA*
494 - wavefrontsize64 flat
497 work-item Add product
500 ``gfx1151`` ``amdgcn`` APU - cumode - Architected *TBA*
501 - wavefrontsize64 flat
504 work-item Add product
507 ``gfx1200`` ``amdgcn`` dGPU - cumode - Architected *TBA*
508 - wavefrontsize64 flat
511 work-item Add product
514 ``gfx1201`` ``amdgcn`` dGPU - cumode - Architected *TBA*
515 - wavefrontsize64 flat
518 work-item Add product
521 =========== =============== ============ ===== ================= =============== =============== ======================
523 .. _amdgpu-target-features:
528 Target features control how code is generated to support certain
529 processor specific features. Not all target features are supported by
530 all processors. The runtime must ensure that the features supported by
531 the device used to execute the code match the features enabled when
532 generating the code. A mismatch of features may result in incorrect
533 execution, or a reduction in performance.
535 The target features supported by each processor is listed in
536 :ref:`amdgpu-processor-table`.
538 Target features are controlled by exactly one of the following Clang
541 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
543 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
544 optional components of the target ID. If omitted, the target feature has the
545 ``any`` value. See :ref:`amdgpu-target-id`.
547 ``-m[no-]<target-feature>``
549 Target features not specified by the target ID are specified using a
550 separate option. These target features can have an ``on`` or ``off``
551 value. ``on`` is specified by omitting the ``no-`` prefix, and
552 ``off`` is specified by including the ``no-`` prefix. The default
553 if not specified is ``off``.
557 ``-mcpu=gfx908:xnack+``
558 Enable the ``xnack`` feature.
559 ``-mcpu=gfx908:xnack-``
560 Disable the ``xnack`` feature.
562 Enable the ``cumode`` feature.
564 Disable the ``cumode`` feature.
566 .. table:: AMDGPU Target Features
567 :name: amdgpu-target-features-table
569 =============== ============================ ==================================================
570 Target Feature Clang Option to Control Description
572 =============== ============================ ==================================================
573 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
574 when generating code for kernels. When disabled
575 native WGP wavefront execution mode is used,
576 when enabled CU wavefront execution mode is used
577 (see :ref:`amdgpu-amdhsa-memory-model`).
579 sramecc - ``-mcpu`` If specified, generate code that can only be
580 - ``--offload-arch`` loaded and executed in a process that has a
581 matching setting for SRAMECC.
583 If not specified for code object V2 to V3, generate
584 code that can be loaded and executed in a process
585 with SRAMECC enabled.
587 If not specified for code object V4 or above, generate
588 code that can be loaded and executed in a process
589 with either setting of SRAMECC.
591 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
592 work-groups are launched in threadgroup split mode.
593 When enabled the waves of a work-group may be
594 launched in different CUs.
596 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
597 generating code for kernels. When disabled
598 native wavefront size 32 is used, when enabled
599 wavefront size 64 is used.
601 xnack - ``-mcpu`` If specified, generate code that can only be
602 - ``--offload-arch`` loaded and executed in a process that has a
603 matching setting for XNACK replay.
605 If not specified for code object V2 to V3, generate
606 code that can be loaded and executed in a process
607 with XNACK replay enabled.
609 If not specified for code object V4 or above, generate
610 code that can be loaded and executed in a process
611 with either setting of XNACK replay.
613 XNACK replay can be used for demand paging and
614 page migration. If enabled in the device, then if
615 a page fault occurs the code may execute
616 incorrectly unless generated with XNACK replay
617 enabled, or generated for code object V4 or above without
618 specifying XNACK replay. Executing code that was
619 generated with XNACK replay enabled, or generated
620 for code object V4 or above without specifying XNACK replay,
621 on a device that does not have XNACK replay
622 enabled will execute correctly but may be less
623 performant than code generated for XNACK replay
625 =============== ============================ ==================================================
627 .. _amdgpu-target-id:
632 AMDGPU supports target IDs. See `Clang Offload Bundler
633 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
634 description. The AMDGPU target specific information is:
637 Is an AMDGPU processor or alternative processor name specified in
638 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
639 the primary processor and alternative processor names. The canonical form
640 target ID only allow the primary processor name.
643 Is a target feature name specified in :ref:`amdgpu-target-features-table` that
644 is supported by the processor. The target features supported by each processor
645 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
646 a target ID are marked as being controlled by ``-mcpu`` and
647 ``--offload-arch``. Each target feature must appear at most once in a target
648 ID. The non-canonical form target ID allows the target features to be
649 specified in any order. The canonical form target ID requires the target
650 features to be specified in alphabetic order.
652 .. _amdgpu-target-id-v2-v3:
654 Code Object V2 to V3 Target ID
655 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
657 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
658 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
659 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
660 directive and the bundle entry ID. In those cases it has the following BNF
665 <target-id> ::== <processor> ( "+" <target-feature> )*
667 Where a target feature is omitted if *Off* and present if *On* or *Any*.
671 The code object V2 to V3 cannot represent *Any* and treats it the same as
674 .. _amdgpu-embedding-bundled-objects:
676 Embedding Bundled Code Objects
677 ------------------------------
679 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
680 as described in `Clang Offload Bundler
681 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
685 The target ID syntax used for code object V2 to V3 for a bundle entry ID
686 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
688 .. _amdgpu-address-spaces:
693 The AMDGPU architecture supports a number of memory address spaces. The address
694 space names use the OpenCL standard names, with some additions.
696 The AMDGPU address spaces correspond to target architecture specific LLVM
697 address space numbers used in LLVM IR.
699 The AMDGPU address spaces are described in
700 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
701 supported for the ``amdgcn`` target.
703 .. table:: AMDGPU Address Spaces
704 :name: amdgpu-address-spaces-table
706 ===================================== =============== =========== ================ ======= ============================
707 .. 64-Bit Process Address Space
708 ------------------------------------- --------------- ----------- ---------------- ------------------------------------
709 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value
710 Space Number Name Name Size
711 ===================================== =============== =========== ================ ======= ============================
712 Generic 0 flat flat 64 0x0000000000000000
713 Global 1 global global 64 0x0000000000000000
714 Region 2 N/A GDS 32 *not implemented for AMDHSA*
715 Local 3 group LDS 32 0xFFFFFFFF
716 Constant 4 constant *same as global* 64 0x0000000000000000
717 Private 5 private scratch 32 0xFFFFFFFF
718 Constant 32-bit 6 *TODO* 0x00000000
719 Buffer Fat Pointer (experimental) 7 *TODO*
720 Buffer Resource (experimental) 8 *TODO*
721 Buffer Strided Pointer (experimental) 9 *TODO*
722 Streamout Registers 128 N/A GS_REGS
723 ===================================== =============== =========== ================ ======= ============================
726 The generic address space is supported unless the *Target Properties* column
727 of :ref:`amdgpu-processor-table` specifies *Does not support generic address
730 The generic address space uses the hardware flat address support for two fixed
731 ranges of virtual addresses (the private and local apertures), that are
732 outside the range of addressable global memory, to map from a flat address to
733 a private or local address. This uses FLAT instructions that can take a flat
734 address and access global, private (scratch), and group (LDS) memory depending
735 on if the address is within one of the aperture ranges.
737 Flat access to scratch requires hardware aperture setup and setup in the
738 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
739 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
740 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
742 To convert between a private or group address space address (termed a segment
743 address) and a flat address the base address of the corresponding aperture
744 can be used. For GFX7-GFX8 these are available in the
745 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
746 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
747 GFX9-GFX11 the aperture base addresses are directly available as inline
748 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
749 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
750 aligned to 2^32 which makes it easier to convert from flat to segment or
753 A global address space address has the same value when used as a flat address
754 so no conversion is needed.
756 **Global and Constant**
757 The global and constant address spaces both use global virtual addresses,
758 which are the same virtual address space used by the CPU. However, some
759 virtual addresses may only be accessible to the CPU, some only accessible
760 by the GPU, and some by both.
762 Using the constant address space indicates that the data will not change
763 during the execution of the kernel. This allows scalar read instructions to
764 be used. As the constant address space could only be modified on the host
765 side, a generic pointer loaded from the constant address space is safe to be
766 assumed as a global pointer since only the device global memory is visible
767 and managed on the host side. The vector and scalar L1 caches are invalidated
768 of volatile data before each kernel dispatch execution to allow constant
769 memory to change values between kernel dispatches.
772 The region address space uses the hardware Global Data Store (GDS). All
773 wavefronts executing on the same device will access the same memory for any
774 given region address. However, the same region address accessed by wavefronts
775 executing on different devices will access different memory. It is higher
776 performance than global memory. It is allocated by the runtime. The data
777 store (DS) instructions can be used to access it.
780 The local address space uses the hardware Local Data Store (LDS) which is
781 automatically allocated when the hardware creates the wavefronts of a
782 work-group, and freed when all the wavefronts of a work-group have
783 terminated. All wavefronts belonging to the same work-group will access the
784 same memory for any given local address. However, the same local address
785 accessed by wavefronts belonging to different work-groups will access
786 different memory. It is higher performance than global memory. The data store
787 (DS) instructions can be used to access it.
790 The private address space uses the hardware scratch memory support which
791 automatically allocates memory when it creates a wavefront and frees it when
792 a wavefronts terminates. The memory accessed by a lane of a wavefront for any
793 given private address will be different to the memory accessed by another lane
794 of the same or different wavefront for the same private address.
796 If a kernel dispatch uses scratch, then the hardware allocates memory from a
797 pool of backing memory allocated by the runtime for each wavefront. The lanes
798 of the wavefront access this using dword (4 byte) interleaving. The mapping
799 used from private address to backing memory address is:
801 ``wavefront-scratch-base +
802 ((private-address / 4) * wavefront-size * 4) +
803 (wavefront-lane-id * 4) + (private-address % 4)``
805 If each lane of a wavefront accesses the same private address, the
806 interleaving results in adjacent dwords being accessed and hence requires
807 fewer cache lines to be fetched.
809 There are different ways that the wavefront scratch base address is
810 determined by a wavefront (see
811 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
813 Scratch memory can be accessed in an interleaved manner using buffer
814 instructions with the scratch buffer descriptor and per wavefront scratch
815 offset, by the scratch instructions, or by flat instructions. Multi-dword
816 access is not supported except by flat and scratch instructions in
819 Code that manipulates the stack values in other lanes of a wavefront,
820 such as by ``addrspacecast``-ing stack pointers to generic ones and taking offsets
821 that reach other lanes or by explicitly constructing the scratch buffer descriptor,
822 triggers undefined behavior when it modifies the scratch values of other lanes.
823 The compiler may assume that such modifications do not occur.
824 When using code object V5 ``LIBOMPTARGET_STACK_SIZE`` may be used to provide the
825 private segment size in bytes, for cases where a dynamic stack is used.
830 **Buffer Fat Pointer**
831 The buffer fat pointer is an experimental address space that is currently
832 unsupported in the backend. It exposes a non-integral pointer that is in
833 the future intended to support the modelling of 128-bit buffer descriptors
834 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
835 *pointer*), allowing normal LLVM load/store/atomic operations to be used to
836 model the buffer descriptors used heavily in graphics workloads targeting
839 The buffer descriptor used to construct a buffer fat pointer must be *raw*:
840 the stride must be 0, the "add tid" flag must be 0, the swizzle enable bits
841 must be off, and the extent must be measured in bytes. (On subtargets where
842 bounds checking may be disabled, buffer fat pointers may choose to enable
846 The buffer resource pointer, in address space 8, is the newer form
847 for representing buffer descriptors in AMDGPU IR, replacing their
848 previous representation as `<4 x i32>`. It is a non-integral pointer
849 that represents a 128-bit buffer descriptor resource (`V#`).
851 Since, in general, a buffer resource supports complex addressing modes that cannot
852 be easily represented in LLVM (such as implicit swizzled access to structured
853 buffers), it is **illegal** to perform non-trivial address computations, such as
854 ``getelementptr`` operations, on buffer resources. They may be passed to
855 AMDGPU buffer intrinsics, and they may be converted to and from ``i128``.
857 Casting a buffer resource to a buffer fat pointer is permitted and adds an offset
860 Buffer resources can be created from 64-bit pointers (which should be either
861 generic or global) using the `llvm.amdgcn.make.buffer.rsrc` intrinsic, which
862 takes the pointer, which becomes the base of the resource,
863 the 16-bit stride (and swzizzle control) field stored in bits `63:48` of a `V#`,
864 the 32-bit NumRecords/extent field (bits `95:64`), and the 32-bit flags field
865 (bits `127:96`). The specific interpretation of these fields varies by the
866 target architecture and is detailed in the ISA descriptions.
868 **Buffer Strided Pointer**
869 The buffer index pointer is an experimental address space. It represents
870 a 128-bit buffer descriptor and a 32-bit offset, like the **Buffer Fat
871 Pointer**. Additionally, it contains an index into the buffer, which
872 allows the direct addressing of structured elements. These components appear
873 in that order, i.e., the descriptor comes first, then the 32-bit offset
874 followed by the 32-bit index.
876 The bits in the buffer descriptor must meet the following requirements:
877 the stride is the size of a structured element, the "add tid" flag must be 0,
878 and the swizzle enable bits must be off.
880 **Streamout Registers**
881 Dedicated registers used by the GS NGG Streamout Instructions. The register
882 file is modelled as a memory in a distinct address space because it is indexed
883 by an address-like offset in place of named registers, and because register
884 accesses affect LGKMcnt. This is an internal address space used only by the
885 compiler. Do not use this address space for IR pointers.
887 .. _amdgpu-memory-scopes:
892 This section provides LLVM memory synchronization scopes supported by the AMDGPU
893 backend memory model when the target triple OS is ``amdhsa`` (see
894 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
896 The memory model supported is based on the HSA memory model [HSA]_ which is
897 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
898 relation is transitive over the synchronizes-with relation independent of scope
899 and synchronizes-with allows the memory scope instances to be inclusive (see
900 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
902 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
903 inclusion and requires the memory scopes to exactly match. However, this
904 is conservatively correct for OpenCL.
906 .. table:: AMDHSA LLVM Sync Scopes
907 :name: amdgpu-amdhsa-llvm-sync-scopes-table
909 ======================= ===================================================
910 LLVM Sync Scope Description
911 ======================= ===================================================
912 *none* The default: ``system``.
914 Synchronizes with, and participates in modification
915 and seq_cst total orderings with, other operations
916 (except image operations) for all address spaces
917 (except private, or generic that accesses private)
918 provided the other operation's sync scope is:
921 - ``agent`` and executed by a thread on the same
923 - ``workgroup`` and executed by a thread in the
925 - ``wavefront`` and executed by a thread in the
928 ``agent`` Synchronizes with, and participates in modification
929 and seq_cst total orderings with, other operations
930 (except image operations) for all address spaces
931 (except private, or generic that accesses private)
932 provided the other operation's sync scope is:
934 - ``system`` or ``agent`` and executed by a thread
936 - ``workgroup`` and executed by a thread in the
938 - ``wavefront`` and executed by a thread in the
941 ``workgroup`` Synchronizes with, and participates in modification
942 and seq_cst total orderings with, other operations
943 (except image operations) for all address spaces
944 (except private, or generic that accesses private)
945 provided the other operation's sync scope is:
947 - ``system``, ``agent`` or ``workgroup`` and
948 executed by a thread in the same work-group.
949 - ``wavefront`` and executed by a thread in the
952 ``wavefront`` Synchronizes with, and participates in modification
953 and seq_cst total orderings with, other operations
954 (except image operations) for all address spaces
955 (except private, or generic that accesses private)
956 provided the other operation's sync scope is:
958 - ``system``, ``agent``, ``workgroup`` or
959 ``wavefront`` and executed by a thread in the
962 ``singlethread`` Only synchronizes with and participates in
963 modification and seq_cst total orderings with,
964 other operations (except image operations) running
965 in the same thread for all address spaces (for
966 example, in signal handlers).
968 ``one-as`` Same as ``system`` but only synchronizes with other
969 operations within the same address space.
971 ``agent-one-as`` Same as ``agent`` but only synchronizes with other
972 operations within the same address space.
974 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with
975 other operations within the same address space.
977 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with
978 other operations within the same address space.
980 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
981 other operations within the same address space.
982 ======================= ===================================================
987 The AMDGPU backend implements the following LLVM IR intrinsics.
989 *This section is WIP.*
991 .. table:: AMDGPU LLVM IR Intrinsics
992 :name: amdgpu-llvm-ir-intrinsics-table
994 ============================================== ==========================================================
995 LLVM Intrinsic Description
996 ============================================== ==========================================================
997 llvm.amdgcn.sqrt Provides direct access to v_sqrt_f64, v_sqrt_f32 and v_sqrt_f16
998 (on targets with half support). Performs sqrt function.
1000 llvm.amdgcn.log Provides direct access to v_log_f32 and v_log_f16
1001 (on targets with half support). Performs log2 function.
1003 llvm.amdgcn.exp2 Provides direct access to v_exp_f32 and v_exp_f16
1004 (on targets with half support). Performs exp2 function.
1006 :ref:`llvm.frexp <int_frexp>` Implemented for half, float and double.
1008 :ref:`llvm.log2 <int_log2>` Implemented for float and half (and vectors of float or
1009 half). Not implemented for double. Hardware provides
1010 1ULP accuracy for float, and 0.51ULP for half. Float
1011 instruction does not natively support denormal
1014 :ref:`llvm.sqrt <int_sqrt>` Implemented for double, float and half (and vectors).
1016 :ref:`llvm.log <int_log>` Implemented for float and half (and vectors).
1018 :ref:`llvm.exp <int_exp>` Implemented for float and half (and vectors).
1020 :ref:`llvm.log10 <int_log10>` Implemented for float and half (and vectors).
1022 :ref:`llvm.exp2 <int_exp2>` Implemented for float and half (and vectors of float or
1023 half). Not implemented for double. Hardware provides
1024 1ULP accuracy for float, and 0.51ULP for half. Float
1025 instruction does not natively support denormal
1028 :ref:`llvm.stacksave.p5 <int_stacksave>` Implemented, must use the alloca address space.
1029 :ref:`llvm.stackrestore.p5 <int_stackrestore>` Implemented, must use the alloca address space.
1031 :ref:`llvm.get.fpmode.i32 <int_get_fpmode>` The natural floating-point mode type is i32. This
1032 implemented by extracting relevant bits out of the MODE
1033 register with s_getreg_b32. The first 10 bits are the
1034 core floating-point mode. Bits 12:18 are the exception
1035 mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not
1036 relevant to floating-point instructions are 0s.
1038 :ref:`llvm.get.rounding<int_get_rounding>` AMDGPU supports two separately controllable rounding
1039 modes depending on the floating-point type. One
1040 controls float, and the other controls both double and
1041 half operations. If both modes are the same, returns
1042 one of the standard return values. If the modes are
1043 different, returns one of :ref:`12 extended values
1044 <amdgpu-rounding-mode-enumeration-values-table>`
1045 describing the two modes.
1047 To nearest, ties away from zero is not a supported
1048 mode. The raw rounding mode values in the MODE
1049 register do not exactly match the FLT_ROUNDS values,
1050 so a conversion is performed.
1052 llvm.amdgcn.wave.reduce.umin Performs an arithmetic unsigned min reduction on the unsigned values
1053 provided by each lane in the wavefront.
1054 Intrinsic takes a hint for reduction strategy using second operand
1055 0: Target default preference,
1056 1: `Iterative strategy`, and
1058 If target does not support the DPP operations (e.g. gfx6/7),
1059 reduction will be performed using default iterative strategy.
1060 Intrinsic is currently only implemented for i32.
1062 llvm.amdgcn.wave.reduce.umax Performs an arithmetic unsigned max reduction on the unsigned values
1063 provided by each lane in the wavefront.
1064 Intrinsic takes a hint for reduction strategy using second operand
1065 0: Target default preference,
1066 1: `Iterative strategy`, and
1068 If target does not support the DPP operations (e.g. gfx6/7),
1069 reduction will be performed using default iterative strategy.
1070 Intrinsic is currently only implemented for i32.
1072 llvm.amdgcn.udot2 Provides direct access to v_dot2_u32_u16 across targets which
1073 support such instructions. This performs unsigned dot product
1074 with two v2i16 operands, summed with the third i32 operand. The
1075 i1 fourth operand is used to clamp the output.
1077 llvm.amdgcn.udot4 Provides direct access to v_dot4_u32_u8 across targets which
1078 support such instructions. This performs unsigned dot product
1079 with two i32 operands (holding a vector of 4 8bit values), summed
1080 with the third i32 operand. The i1 fourth operand is used to clamp
1083 llvm.amdgcn.udot8 Provides direct access to v_dot8_u32_u4 across targets which
1084 support such instructions. This performs unsigned dot product
1085 with two i32 operands (holding a vector of 8 4bit values), summed
1086 with the third i32 operand. The i1 fourth operand is used to clamp
1089 llvm.amdgcn.sdot2 Provides direct access to v_dot2_i32_i16 across targets which
1090 support such instructions. This performs signed dot product
1091 with two v2i16 operands, summed with the third i32 operand. The
1092 i1 fourth operand is used to clamp the output.
1093 When applicable (e.g. no clamping), this is lowered into
1094 v_dot2c_i32_i16 for targets which support it.
1096 llvm.amdgcn.sdot4 Provides direct access to v_dot4_i32_i8 across targets which
1097 support such instructions. This performs signed dot product
1098 with two i32 operands (holding a vector of 4 8bit values), summed
1099 with the third i32 operand. The i1 fourth operand is used to clamp
1101 When applicable (i.e. no clamping / operand modifiers), this is lowered
1102 into v_dot4c_i32_i8 for targets which support it.
1103 RDNA3 does not offer v_dot4_i32_i8, and rather offers
1104 v_dot4_i32_iu8 which has operands to hold the signedness of the
1105 vector operands. Thus, this intrinsic lowers to the signed version
1106 of this instruction for gfx11 targets.
1108 llvm.amdgcn.sdot8 Provides direct access to v_dot8_u32_u4 across targets which
1109 support such instructions. This performs signed dot product
1110 with two i32 operands (holding a vector of 8 4bit values), summed
1111 with the third i32 operand. The i1 fourth operand is used to clamp
1113 When applicable (i.e. no clamping / operand modifiers), this is lowered
1114 into v_dot8c_i32_i4 for targets which support it.
1115 RDNA3 does not offer v_dot8_i32_i4, and rather offers
1116 v_dot4_i32_iu4 which has operands to hold the signedness of the
1117 vector operands. Thus, this intrinsic lowers to the signed version
1118 of this instruction for gfx11 targets.
1120 llvm.amdgcn.sudot4 Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs
1121 dot product with two i32 operands (holding a vector of 4 8bit values), summed
1122 with the fifth i32 operand. The i1 sixth operand is used to clamp
1123 the output. The i1s preceding the vector operands decide the signedness.
1125 llvm.amdgcn.sudot8 Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs
1126 dot product with two i32 operands (holding a vector of 8 4bit values), summed
1127 with the fifth i32 operand. The i1 sixth operand is used to clamp
1128 the output. The i1s preceding the vector operands decide the signedness.
1130 llvm.amdgcn.sched_barrier Controls the types of instructions that may be allowed to cross the intrinsic
1131 during instruction scheduling. The parameter is a mask for the instruction types
1132 that can cross the intrinsic.
1134 - 0x0000: No instructions may be scheduled across sched_barrier.
1135 - 0x0001: All, non-memory, non-side-effect producing instructions may be
1136 scheduled across sched_barrier, *i.e.* allow ALU instructions to pass.
1137 - 0x0002: VALU instructions may be scheduled across sched_barrier.
1138 - 0x0004: SALU instructions may be scheduled across sched_barrier.
1139 - 0x0008: MFMA/WMMA instructions may be scheduled across sched_barrier.
1140 - 0x0010: All VMEM instructions may be scheduled across sched_barrier.
1141 - 0x0020: VMEM read instructions may be scheduled across sched_barrier.
1142 - 0x0040: VMEM write instructions may be scheduled across sched_barrier.
1143 - 0x0080: All DS instructions may be scheduled across sched_barrier.
1144 - 0x0100: All DS read instructions may be scheduled accoss sched_barrier.
1145 - 0x0200: All DS write instructions may be scheduled across sched_barrier.
1147 llvm.amdgcn.sched_group_barrier Creates schedule groups with specific properties to create custom scheduling
1148 pipelines. The ordering between groups is enforced by the instruction scheduler.
1149 The intrinsic applies to the code that preceeds the intrinsic. The intrinsic
1150 takes three values that control the behavior of the schedule groups.
1152 - Mask : Classify instruction groups using the llvm.amdgcn.sched_barrier mask values.
1153 - Size : The number of instructions that are in the group.
1154 - SyncID : Order is enforced between groups with matching values.
1156 The mask can include multiple instruction types. It is undefined behavior to set
1157 values beyond the range of valid masks.
1159 Combining multiple sched_group_barrier intrinsics enables an ordering of specific
1160 instruction types during instruction scheduling. For example, the following enforces
1161 a sequence of 1 VMEM read, followed by 1 VALU instruction, followed by 5 MFMA
1164 | ``// 1 VMEM read``
1165 | ``__builtin_amdgcn_sched_group_barrier(32, 1, 0)``
1167 | ``__builtin_amdgcn_sched_group_barrier(2, 1, 0)``
1169 | ``__builtin_amdgcn_sched_group_barrier(8, 5, 0)``
1171 llvm.amdgcn.iglp_opt An **experimental** intrinsic for instruction group level parallelism. The intrinsic
1172 implements predefined intruction scheduling orderings. The intrinsic applies to the
1173 surrounding scheduling region. The intrinsic takes a value that specifies the
1174 strategy. The compiler implements two strategies.
1176 0. Interleave DS and MFMA instructions for small GEMM kernels.
1177 1. Interleave DS and MFMA instructions for single wave small GEMM kernels.
1179 Only one iglp_opt intrinsic may be used in a scheduling region. The iglp_opt intrinsic
1180 cannot be combined with sched_barrier or sched_group_barrier.
1182 The iglp_opt strategy implementations are subject to change.
1184 ============================================== ==========================================================
1188 List AMDGPU intrinsics.
1193 The AMDGPU backend supports the following LLVM IR attributes.
1195 .. table:: AMDGPU LLVM IR Attributes
1196 :name: amdgpu-llvm-ir-attributes-table
1198 ======================================= ==========================================================
1199 LLVM Attribute Description
1200 ======================================= ==========================================================
1201 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
1202 will be specified when the kernel is dispatched. Generated
1203 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
1204 The IR implied default value is 1,1024. Clang may emit this attribute
1205 with more restrictive bounds depending on language defaults.
1206 If the actual block or workgroup size exceeds the limit at any point during
1207 the execution, the behavior is undefined. For example, even if there is
1208 only one active thread but the thread local id exceeds the limit, the
1209 behavior is undefined.
1211 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
1212 argument block size for the implicit arguments. This
1213 varies by OS and language (for OpenCL see
1214 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
1215 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
1216 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
1217 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
1218 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
1219 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
1220 execution unit. Generated by the ``amdgpu_waves_per_eu``
1221 CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
1222 and the backend may not be able to satisfy the request. If
1223 the specified range is incompatible with the function's
1224 "amdgpu-flat-work-group-size" value, the implied occupancy
1225 bounds by the workgroup size takes precedence.
1227 "amdgpu-ieee" true/false. GFX6-GFX11 Only
1228 Specify whether the function expects the IEEE field of the
1229 mode register to be set on entry. Overrides the default for
1230 the calling convention.
1231 "amdgpu-dx10-clamp" true/false. GFX6-GFX11 Only
1232 Specify whether the function expects the DX10_CLAMP field of
1233 the mode register to be set on entry. Overrides the default
1234 for the calling convention.
1236 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the
1237 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
1238 attribute, or reached through a call site marked with this attribute,
1239 the value returned by the intrinsic is undefined. The backend can
1240 generally infer this during code generation, so typically there is no
1241 benefit to frontends marking functions with this.
1243 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the
1244 llvm.amdgcn.workitem.id.y intrinsic.
1246 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the
1247 llvm.amdgcn.workitem.id.z intrinsic.
1249 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the
1250 llvm.amdgcn.workgroup.id.x intrinsic.
1252 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the
1253 llvm.amdgcn.workgroup.id.y intrinsic.
1255 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the
1256 llvm.amdgcn.workgroup.id.z intrinsic.
1258 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the
1259 llvm.amdgcn.dispatch.ptr intrinsic.
1261 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the
1262 llvm.amdgcn.implicitarg.ptr intrinsic.
1264 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the
1265 llvm.amdgcn.dispatch.id intrinsic.
1267 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the
1268 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
1269 attributes, the queue pointer may be required in situations where the
1270 intrinsic call does not directly appear in the program. Some subtargets
1271 require the queue pointer for to handle some addrspacecasts, as well
1272 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
1273 llvm.debug intrinsics.
1275 "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1276 kernel argument that holds the pointer to the hostcall buffer. If this
1277 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1279 "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1280 kernel argument that holds the pointer to an initialized memory buffer
1281 that conforms to the requirements of the malloc/free device library V1
1282 version implementation. If this attribute is absent, then the
1283 amdgpu-no-implicitarg-ptr is also removed.
1285 "amdgpu-no-multigrid-sync-arg" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1286 kernel argument that holds the multigrid synchronization pointer. If this
1287 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1289 "amdgpu-no-default-queue" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1290 kernel argument that holds the default queue pointer. If this
1291 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1293 "amdgpu-no-completion-action" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1294 kernel argument that holds the completion action pointer. If this
1295 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1297 "amdgpu-lds-size"="min[,max]" Min is the minimum number of bytes that will be allocated in the Local
1298 Data Store at address zero. Variables are allocated within this frame
1299 using absolute symbol metadata, primarily by the AMDGPULowerModuleLDS
1300 pass. Optional max is the maximum number of bytes that will be allocated.
1301 Note that min==max indicates that no further variables can be added to
1302 the frame. This is an internal detail of how LDS variables are lowered,
1303 language front ends should not set this attribute.
1305 ======================================= ==========================================================
1310 The AMDGPU backend supports the following calling conventions:
1312 .. table:: AMDGPU Calling Conventions
1315 =============================== ==========================================================
1316 Calling Convention Description
1317 =============================== ==========================================================
1318 ``ccc`` The C calling convention. Used by default.
1319 See :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions`
1322 ``fastcc`` The fast calling convention. Mostly the same as the ``ccc``.
1324 ``coldcc`` The cold calling convention. Mostly the same as the ``ccc``.
1326 ``amdgpu_cs`` Used for Mesa/AMDPAL compute shaders.
1330 ``amdgpu_cs_chain`` Similar to ``amdgpu_cs``, with differences described below.
1332 Functions with this calling convention cannot be called directly. They must
1333 instead be launched via the ``llvm.amdgcn.cs.chain`` intrinsic.
1335 Arguments are passed in SGPRs, starting at s0, if they have the ``inreg``
1336 attribute, and in VGPRs otherwise, starting at v8. Using more SGPRs or VGPRs
1337 than available in the subtarget is not allowed. On subtargets that use
1338 a scratch buffer descriptor (as opposed to ``scratch_{load,store}_*`` instructions),
1339 the scratch buffer descriptor is passed in s[48:51]. This limits the
1340 SGPR / ``inreg`` arguments to the equivalent of 48 dwords; using more
1341 than that is not allowed.
1343 The return type must be void.
1344 Varargs, sret, byval, byref, inalloca, preallocated are not supported.
1346 Values in scalar registers as well as v0-v7 are not preserved. Values in
1347 VGPRs starting at v8 are not preserved for the active lanes, but must be
1348 saved by the callee for inactive lanes when using WWM.
1350 Wave scratch is "empty" at function boundaries. There is no stack pointer input
1351 or output value, but functions are free to use scratch starting from an initial
1352 stack pointer. Calls to ``amdgpu_gfx`` functions are allowed and behave like they
1353 do in ``amdgpu_cs`` functions.
1355 All counters (``lgkmcnt``, ``vmcnt``, ``storecnt``, etc.) are presumed in an
1356 unknown state at function entry.
1358 A function may have multiple exits (e.g. one chain exit and one plain ``ret void``
1359 for when the wave ends), but all ``llvm.amdgcn.cs.chain`` exits must be in
1360 uniform control flow.
1362 ``amdgpu_cs_chain_preserve`` Same as ``amdgpu_cs_chain``, but active lanes for VGPRs starting at v8 are preserved.
1363 Calls to ``amdgpu_gfx`` functions are not allowed, and any calls to ``llvm.amdgcn.cs.chain``
1364 must not pass more VGPR arguments than the caller's VGPR function parameters.
1366 ``amdgpu_es`` Used for AMDPAL shader stage before geometry shader if geometry is in
1367 use. So either the domain (= tessellation evaluation) shader if
1368 tessellation is in use, or otherwise the vertex shader.
1372 ``amdgpu_gfx`` Used for AMD graphics targets. Functions with this calling convention
1373 cannot be used as entry points.
1377 ``amdgpu_gs`` Used for Mesa/AMDPAL geometry shaders.
1381 ``amdgpu_hs`` Used for Mesa/AMDPAL hull shaders (= tessellation control shaders).
1385 ``amdgpu_kernel`` See :ref:`amdgpu-amdhsa-function-call-convention-kernel-functions`
1387 ``amdgpu_ls`` Used for AMDPAL vertex shader if tessellation is in use.
1391 ``amdgpu_ps`` Used for Mesa/AMDPAL pixel shaders.
1395 ``amdgpu_vs`` Used for Mesa/AMDPAL last shader stage before rasterization (vertex
1396 shader if tessellation and geometry are not in use, or otherwise
1397 copy shader if one is needed).
1401 =============================== ==========================================================
1404 .. _amdgpu-elf-code-object:
1409 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
1410 can be linked by ``lld`` to produce a standard ELF shared code object which can
1411 be loaded and executed on an AMDGPU target.
1413 .. _amdgpu-elf-header:
1418 The AMDGPU backend uses the following ELF header:
1420 .. table:: AMDGPU ELF Header
1421 :name: amdgpu-elf-header-table
1423 ========================== ===============================
1425 ========================== ===============================
1426 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
1427 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
1428 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
1429 - ``ELFOSABI_AMDGPU_HSA``
1430 - ``ELFOSABI_AMDGPU_PAL``
1431 - ``ELFOSABI_AMDGPU_MESA3D``
1432 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1433 - ``ELFABIVERSION_AMDGPU_HSA_V3``
1434 - ``ELFABIVERSION_AMDGPU_HSA_V4``
1435 - ``ELFABIVERSION_AMDGPU_HSA_V5``
1436 - ``ELFABIVERSION_AMDGPU_PAL``
1437 - ``ELFABIVERSION_AMDGPU_MESA3D``
1438 ``e_type`` - ``ET_REL``
1440 ``e_machine`` ``EM_AMDGPU``
1442 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1443 :ref:`amdgpu-elf-header-e_flags-table-v3`,
1444 and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
1445 ========================== ===============================
1449 .. table:: AMDGPU ELF Header Enumeration Values
1450 :name: amdgpu-elf-header-enumeration-values-table
1452 =============================== =====
1454 =============================== =====
1457 ``ELFOSABI_AMDGPU_HSA`` 64
1458 ``ELFOSABI_AMDGPU_PAL`` 65
1459 ``ELFOSABI_AMDGPU_MESA3D`` 66
1460 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1461 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1462 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1463 ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1464 ``ELFABIVERSION_AMDGPU_PAL`` 0
1465 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1466 =============================== =====
1468 ``e_ident[EI_CLASS]``
1471 * ``ELFCLASS32`` for ``r600`` architecture.
1473 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1474 process address space applications.
1476 ``e_ident[EI_DATA]``
1477 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1479 ``e_ident[EI_OSABI]``
1480 One of the following AMDGPU target architecture specific OS ABIs
1481 (see :ref:`amdgpu-os`):
1483 * ``ELFOSABI_NONE`` for *unknown* OS.
1485 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1487 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1489 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1491 ``e_ident[EI_ABIVERSION]``
1492 The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1495 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1496 runtime ABI for code object V2. Can no longer be emitted by this version of LLVM.
1498 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1499 runtime ABI for code object V3. Can no longer be emitted by this version of LLVM.
1501 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1502 runtime ABI for code object V4. Specify using the Clang option
1503 ``-mcode-object-version=4``. This is the default code object
1504 version if not specified.
1506 * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1507 runtime ABI for code object V5. Specify using the Clang option
1508 ``-mcode-object-version=5``.
1510 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1513 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1517 Can be one of the following values:
1521 The type produced by the AMDGPU backend compiler as it is relocatable code
1525 The type produced by the linker as it is a shared code object.
1527 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1530 The value ``EM_AMDGPU`` is used for the machine for all processors supported
1531 by the ``r600`` and ``amdgcn`` architectures (see
1532 :ref:`amdgpu-processor-table`). The specific processor is specified in the
1533 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1534 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1535 ``e_flags`` for code object V3 and above (see
1536 :ref:`amdgpu-elf-header-e_flags-table-v3` and
1537 :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
1540 The entry point is 0 as the entry points for individual kernels must be
1541 selected in order to invoke them through AQL packets.
1544 The AMDGPU backend uses the following ELF header flags:
1546 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1547 :name: amdgpu-elf-header-e_flags-v2-table
1549 ===================================== ===== =============================
1550 Name Value Description
1551 ===================================== ===== =============================
1552 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack``
1554 enabled for all code
1555 contained in the code object.
1557 does not support the
1562 :ref:`amdgpu-target-features`.
1563 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap
1564 handler is enabled for all
1565 code contained in the code
1566 object. If the processor
1567 does not support a trap
1568 handler then must be 0.
1570 :ref:`amdgpu-target-features`.
1571 ===================================== ===== =============================
1573 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1574 :name: amdgpu-elf-header-e_flags-table-v3
1576 ================================= ===== =============================
1577 Name Value Description
1578 ================================= ===== =============================
1579 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1581 ``EF_AMDGPU_MACH_xxx`` values
1583 :ref:`amdgpu-ef-amdgpu-mach-table`.
1584 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack``
1586 enabled for all code
1587 contained in the code object.
1589 does not support the
1594 :ref:`amdgpu-target-features`.
1595 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc``
1597 enabled for all code
1598 contained in the code object.
1600 does not support the
1605 :ref:`amdgpu-target-features`.
1606 ================================= ===== =============================
1608 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
1609 :name: amdgpu-elf-header-e_flags-table-v4-onwards
1611 ============================================ ===== ===================================
1612 Name Value Description
1613 ============================================ ===== ===================================
1614 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1616 ``EF_AMDGPU_MACH_xxx`` values
1618 :ref:`amdgpu-ef-amdgpu-mach-table`.
1619 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for
1620 ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1622 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsupported.
1623 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value.
1624 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled.
1625 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled.
1626 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for
1627 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1629 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsupported.
1630 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value.
1631 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled,
1632 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled.
1633 ============================================ ===== ===================================
1635 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1636 :name: amdgpu-ef-amdgpu-mach-table
1638 ==================================== ========== =============================
1639 Name Value Description (see
1640 :ref:`amdgpu-processor-table`)
1641 ==================================== ========== =============================
1642 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
1643 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
1644 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
1645 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
1646 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
1647 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
1648 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
1649 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
1650 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
1651 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
1652 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
1653 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
1654 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
1655 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
1656 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
1657 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
1658 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
1659 *reserved* 0x011 - Reserved for ``r600``
1660 0x01f architecture processors.
1661 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
1662 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
1663 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
1664 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
1665 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
1666 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
1667 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
1668 *reserved* 0x027 Reserved.
1669 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
1670 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
1671 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
1672 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
1673 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
1674 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
1675 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
1676 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
1677 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908``
1678 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
1679 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c``
1680 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010``
1681 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011``
1682 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012``
1683 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030``
1684 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031``
1685 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032``
1686 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033``
1687 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602``
1688 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705``
1689 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805``
1690 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035``
1691 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034``
1692 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a``
1693 ``EF_AMDGPU_MACH_AMDGCN_GFX940`` 0x040 ``gfx940``
1694 ``EF_AMDGPU_MACH_AMDGCN_GFX1100`` 0x041 ``gfx1100``
1695 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013``
1696 ``EF_AMDGPU_MACH_AMDGCN_GFX1150`` 0x043 ``gfx1150``
1697 ``EF_AMDGPU_MACH_AMDGCN_GFX1103`` 0x044 ``gfx1103``
1698 ``EF_AMDGPU_MACH_AMDGCN_GFX1036`` 0x045 ``gfx1036``
1699 ``EF_AMDGPU_MACH_AMDGCN_GFX1101`` 0x046 ``gfx1101``
1700 ``EF_AMDGPU_MACH_AMDGCN_GFX1102`` 0x047 ``gfx1102``
1701 ``EF_AMDGPU_MACH_AMDGCN_GFX1200`` 0x048 ``gfx1200``
1702 *reserved* 0x049 Reserved.
1703 ``EF_AMDGPU_MACH_AMDGCN_GFX1151`` 0x04a ``gfx1151``
1704 ``EF_AMDGPU_MACH_AMDGCN_GFX941`` 0x04b ``gfx941``
1705 ``EF_AMDGPU_MACH_AMDGCN_GFX942`` 0x04c ``gfx942``
1706 *reserved* 0x04d Reserved.
1707 ``EF_AMDGPU_MACH_AMDGCN_GFX1201`` 0x04e ``gfx1201``
1708 ==================================== ========== =============================
1713 An AMDGPU target ELF code object has the standard ELF sections which include:
1715 .. table:: AMDGPU ELF Sections
1716 :name: amdgpu-elf-sections-table
1718 ================== ================ =================================
1719 Name Type Attributes
1720 ================== ================ =================================
1721 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1722 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1723 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
1724 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
1725 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1726 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1727 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1728 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
1729 ``.note`` ``SHT_NOTE`` *none*
1730 ``.rela``\ *name* ``SHT_RELA`` *none*
1731 ``.rela.dyn`` ``SHT_RELA`` *none*
1732 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1733 ``.shstrtab`` ``SHT_STRTAB`` *none*
1734 ``.strtab`` ``SHT_STRTAB`` *none*
1735 ``.symtab`` ``SHT_SYMTAB`` *none*
1736 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1737 ================== ================ =================================
1739 These sections have their standard meanings (see [ELF]_) and are only generated
1743 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1744 information on the DWARF produced by the AMDGPU backend.
1746 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1747 The standard sections used by a dynamic loader.
1750 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1753 ``.rela``\ *name*, ``.rela.dyn``
1754 For relocatable code objects, *name* is the name of the section that the
1755 relocation records apply. For example, ``.rela.text`` is the section name for
1756 relocation records associated with the ``.text`` section.
1758 For linked shared code objects, ``.rela.dyn`` contains all the relocation
1759 records from each of the relocatable code object's ``.rela``\ *name* sections.
1761 See :ref:`amdgpu-relocation-records` for the relocation records supported by
1765 The executable machine code for the kernels and functions they call. Generated
1766 as position independent code. See :ref:`amdgpu-code-conventions` for
1767 information on conventions used in the isa generation.
1769 .. _amdgpu-note-records:
1774 The AMDGPU backend code object contains ELF note records in the ``.note``
1775 section. The set of generated notes and their semantics depend on the code
1776 object version; see :ref:`amdgpu-note-records-v2` and
1777 :ref:`amdgpu-note-records-v3-onwards`.
1779 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1780 must be generated after the ``name`` field to ensure the ``desc`` field is 4
1781 byte aligned. In addition, minimal zero-byte padding must be generated to
1782 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1783 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1786 .. _amdgpu-note-records-v2:
1788 Code Object V2 Note Records
1789 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1792 Code object V2 generation is no longer supported by this version of LLVM.
1794 The AMDGPU backend code object uses the following ELF note record in the
1795 ``.note`` section when compiling for code object V2.
1797 The note record vendor field is "AMD".
1799 Additional note records may be present, but any which are not documented here
1800 are deprecated and should not be used.
1802 .. table:: AMDGPU Code Object V2 ELF Note Records
1803 :name: amdgpu-elf-note-records-v2-table
1805 ===== ===================================== ======================================
1806 Name Type Description
1807 ===== ===================================== ======================================
1808 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version.
1809 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL
1810 Finalizer and not the LLVM compiler.
1811 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version.
1812 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in
1813 YAML [YAML]_ textual format.
1814 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name.
1815 ===== ===================================== ======================================
1819 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1820 :name: amdgpu-elf-note-record-enumeration-values-v2-table
1822 ===================================== =====
1824 ===================================== =====
1825 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1
1826 ``NT_AMD_HSA_HSAIL`` 2
1827 ``NT_AMD_HSA_ISA_VERSION`` 3
1829 ``NT_AMD_HSA_METADATA`` 10
1830 ``NT_AMD_HSA_ISA_NAME`` 11
1831 ===================================== =====
1833 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
1834 Specifies the code object version number. The description field has the
1839 struct amdgpu_hsa_note_code_object_version_s {
1840 uint32_t major_version;
1841 uint32_t minor_version;
1844 The ``major_version`` has a value less than or equal to 2.
1846 ``NT_AMD_HSA_HSAIL``
1847 Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1848 field has the following layout:
1852 struct amdgpu_hsa_note_hsail_s {
1853 uint32_t hsail_major_version;
1854 uint32_t hsail_minor_version;
1856 uint8_t machine_model;
1857 uint8_t default_float_round;
1860 ``NT_AMD_HSA_ISA_VERSION``
1861 Specifies the target ISA version. The description field has the following layout:
1865 struct amdgpu_hsa_note_isa_s {
1866 uint16_t vendor_name_size;
1867 uint16_t architecture_name_size;
1871 char vendor_and_architecture_name[1];
1874 ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1875 vendor and architecture names respectively, including the NUL character.
1877 ``vendor_and_architecture_name`` contains the NUL terminates string for the
1878 vendor, immediately followed by the NUL terminated string for the
1881 This note record is used by the HSA runtime loader.
1883 Code object V2 only supports a limited number of processors and has fixed
1884 settings for target features. See
1885 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1886 processors and the corresponding target ID. In the table the note record ISA
1887 name is a concatenation of the vendor name, architecture name, major, minor,
1888 and stepping separated by a ":".
1890 The target ID column shows the processor name and fixed target features used
1891 by the LLVM compiler. The LLVM compiler does not generate a
1892 ``NT_AMD_HSA_HSAIL`` note record.
1894 A code object generated by the Finalizer also uses code object V2 and always
1895 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1896 ``sramecc`` target feature is as shown in
1897 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1898 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1901 ``NT_AMD_HSA_ISA_NAME``
1902 Specifies the target ISA name as a non-NUL terminated string.
1904 This note record is not used by the HSA runtime loader.
1906 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1907 V2's limited support of processors and fixed settings for target features.
1909 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1910 from the string to the corresponding target ID. If the ``xnack`` target
1911 feature is supported and enabled, the string produced by the LLVM compiler
1912 will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1913 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1915 ``NT_AMD_HSA_METADATA``
1916 Specifies extensible metadata associated with the code objects executed on HSA
1917 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1918 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1919 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1922 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1923 :name: amdgpu-elf-note-record-supported_processors-v2-table
1925 ===================== ==========================
1926 Note Record ISA Name Target ID
1927 ===================== ==========================
1928 ``AMD:AMDGPU:6:0:0`` ``gfx600``
1929 ``AMD:AMDGPU:6:0:1`` ``gfx601``
1930 ``AMD:AMDGPU:6:0:2`` ``gfx602``
1931 ``AMD:AMDGPU:7:0:0`` ``gfx700``
1932 ``AMD:AMDGPU:7:0:1`` ``gfx701``
1933 ``AMD:AMDGPU:7:0:2`` ``gfx702``
1934 ``AMD:AMDGPU:7:0:3`` ``gfx703``
1935 ``AMD:AMDGPU:7:0:4`` ``gfx704``
1936 ``AMD:AMDGPU:7:0:5`` ``gfx705``
1937 ``AMD:AMDGPU:8:0:0`` ``gfx802``
1938 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+``
1939 ``AMD:AMDGPU:8:0:2`` ``gfx802``
1940 ``AMD:AMDGPU:8:0:3`` ``gfx803``
1941 ``AMD:AMDGPU:8:0:4`` ``gfx803``
1942 ``AMD:AMDGPU:8:0:5`` ``gfx805``
1943 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+``
1944 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-``
1945 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+``
1946 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-``
1947 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+``
1948 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-``
1949 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+``
1950 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-``
1951 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+``
1952 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1953 ===================== ==========================
1955 .. _amdgpu-note-records-v3-onwards:
1957 Code Object V3 and Above Note Records
1958 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1960 The AMDGPU backend code object uses the following ELF note record in the
1961 ``.note`` section when compiling for code object V3 and above.
1963 The note record vendor field is "AMDGPU".
1965 Additional note records may be present, but any which are not documented here
1966 are deprecated and should not be used.
1968 .. table:: AMDGPU Code Object V3 and Above ELF Note Records
1969 :name: amdgpu-elf-note-records-table-v3-onwards
1971 ======== ============================== ======================================
1972 Name Type Description
1973 ======== ============================== ======================================
1974 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
1976 ======== ============================== ======================================
1980 .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
1981 :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
1983 ============================== =====
1985 ============================== =====
1987 ``NT_AMDGPU_METADATA`` 32
1988 ============================== =====
1990 ``NT_AMDGPU_METADATA``
1991 Specifies extensible metadata associated with an AMDGPU code object. It is
1992 encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1993 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
1994 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
1995 :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
2003 Symbols include the following:
2005 .. table:: AMDGPU ELF Symbols
2006 :name: amdgpu-elf-symbols-table
2008 ===================== ================== ================ ==================
2009 Name Type Section Description
2010 ===================== ================== ================ ==================
2011 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
2014 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
2015 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
2016 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS
2017 ===================== ================== ================ ==================
2020 Global variables both used and defined by the compilation unit.
2022 If the symbol is defined in the compilation unit then it is allocated in the
2023 appropriate section according to if it has initialized data or is readonly.
2025 If the symbol is external then its section is ``STN_UNDEF`` and the loader
2026 will resolve relocations using the definition provided by another code object
2027 or explicitly defined by the runtime.
2029 If the symbol resides in local/group memory (LDS) then its section is the
2030 special processor specific section name ``SHN_AMDGPU_LDS``, and the
2031 ``st_value`` field describes alignment requirements as it does for common
2036 Add description of linked shared object symbols. Seems undefined symbols
2037 are marked as STT_NOTYPE.
2040 Every HSA kernel has an associated kernel descriptor. It is the address of the
2041 kernel descriptor that is used in the AQL dispatch packet used to invoke the
2042 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
2043 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
2046 Every HSA kernel also has a symbol for its machine code entry point.
2048 .. _amdgpu-relocation-records:
2053 The AMDGPU backend generates ``Elf64_Rela`` relocation records for
2054 AMDHSA or ``Elf64_Rel`` relocation records for Mesa/AMDPAL. Supported
2055 relocatable fields are:
2058 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
2059 alignment. These values use the same byte order as other word values in the
2060 AMDGPU architecture.
2063 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
2064 alignment. These values use the same byte order as other word values in the
2065 AMDGPU architecture.
2067 Following notations are used for specifying relocation calculations:
2070 Represents the addend used to compute the value of the relocatable field. If
2071 the addend field is smaller than 64 bits then it is zero-extended to 64 bits
2072 for use in the calculations below. (In practice this only affects ``_HI``
2073 relocation types on Mesa/AMDPAL, where the addend comes from the 32-bit field
2074 but the result of the calculation depends on the high part of the full 64-bit
2078 Represents the offset into the global offset table at which the relocation
2079 entry's symbol will reside during execution.
2082 Represents the address of the global offset table.
2085 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
2086 of the storage unit being relocated (computed using ``r_offset``).
2089 Represents the value of the symbol whose index resides in the relocation
2090 entry. Relocations not using this must specify a symbol index of
2094 Represents the base address of a loaded executable or shared object which is
2095 the difference between the ELF address and the actual load address.
2096 Relocations using this are only valid in executable or shared objects.
2098 The following relocation types are supported:
2100 .. table:: AMDGPU ELF Relocation Records
2101 :name: amdgpu-elf-relocation-records-table
2103 ========================== ======= ===== ========== ==============================
2104 Relocation Type Kind Value Field Calculation
2105 ========================== ======= ===== ========== ==============================
2106 ``R_AMDGPU_NONE`` 0 *none* *none*
2107 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
2109 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
2111 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
2113 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
2114 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
2115 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
2117 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
2118 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
2119 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
2120 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
2121 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
2123 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
2124 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4
2125 ========================== ======= ===== ========== ==============================
2127 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
2128 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
2130 There is no current OS loader support for 32-bit programs and so
2131 ``R_AMDGPU_ABS32`` is not used.
2133 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
2135 Loaded Code Object Path Uniform Resource Identifier (URI)
2136 ---------------------------------------------------------
2138 The AMD GPU code object loader represents the path of the ELF shared object from
2139 which the code object was loaded as a textual Uniform Resource Identifier (URI).
2140 Note that the code object is the in memory loaded relocated form of the ELF
2141 shared object. Multiple code objects may be loaded at different memory
2142 addresses in the same process from the same ELF shared object.
2144 The loaded code object path URI syntax is defined by the following BNF syntax:
2148 code_object_uri ::== file_uri | memory_uri
2149 file_uri ::== "file://" file_path [ range_specifier ]
2150 memory_uri ::== "memory://" process_id range_specifier
2151 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
2152 file_path ::== URI_ENCODED_OS_FILE_PATH
2153 process_id ::== DECIMAL_NUMBER
2154 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
2157 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
2158 and octal values by "0".
2161 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
2162 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
2163 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in
2164 the path are separated by "/".
2167 Is a 0-based byte offset to the start of the code object. For a file URI, it
2168 is from the start of the file specified by the ``file_path``, and if omitted
2169 defaults to 0. For a memory URI, it is the memory address and is required.
2172 Is the number of bytes in the code object. For a file URI, if omitted it
2173 defaults to the size of the file. It is required for a memory URI.
2176 Is the identity of the process owning the memory. For Linux it is the C
2177 unsigned integral decimal literal for the process ID (PID).
2183 file:///dir1/dir2/file1
2184 file:///dir3/dir4/file2#offset=0x2000&size=3000
2185 memory://1234#offset=0x20000&size=3000
2187 .. _amdgpu-dwarf-debug-information:
2189 DWARF Debug Information
2190 =======================
2194 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
2195 is not currently fully implemented and is subject to change.
2197 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
2198 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
2199 object executable code and data to the source language constructs. It can be
2200 used by tools such as debuggers and profilers. It uses features defined in
2201 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
2202 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
2204 This section defines the AMDGPU target architecture specific DWARF mappings.
2206 .. _amdgpu-dwarf-register-identifier:
2211 This section defines the AMDGPU target architecture register numbers used in
2212 DWARF operation expressions (see DWARF Version 5 section 2.5 and
2213 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
2214 instructions (see DWARF Version 5 section 6.4 and
2215 :ref:`amdgpu-dwarf-call-frame-information`).
2217 A single code object can contain code for kernels that have different wavefront
2218 sizes. The vector registers and some scalar registers are based on the wavefront
2219 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
2220 simplifies the consumer of the DWARF so that each register has a fixed size,
2221 rather than being dynamic according to the wavefront size mode. Similarly,
2222 distinct DWARF registers are defined for those registers that vary in size
2223 according to the process address size. This allows a consumer to treat a
2224 specific AMDGPU processor as a single architecture regardless of how it is
2225 configured at run time. The compiler explicitly specifies the DWARF registers
2226 that match the mode in which the code it is generating will be executed.
2228 DWARF registers are encoded as numbers, which are mapped to architecture
2229 registers. The mapping for AMDGPU is defined in
2230 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
2233 .. table:: AMDGPU DWARF Register Mapping
2234 :name: amdgpu-dwarf-register-mapping-table
2236 ============== ================= ======== ==================================
2237 DWARF Register AMDGPU Register Bit Size Description
2238 ============== ================= ======== ==================================
2239 0 PC_32 32 Program Counter (PC) when
2240 executing in a 32-bit process
2241 address space. Used in the CFI to
2242 describe the PC of the calling
2244 1 EXEC_MASK_32 32 Execution Mask Register when
2245 executing in wavefront 32 mode.
2246 2-15 *Reserved* *Reserved for highly accessed
2247 registers using DWARF shortcut.*
2248 16 PC_64 64 Program Counter (PC) when
2249 executing in a 64-bit process
2250 address space. Used in the CFI to
2251 describe the PC of the calling
2253 17 EXEC_MASK_64 64 Execution Mask Register when
2254 executing in wavefront 64 mode.
2255 18-31 *Reserved* *Reserved for highly accessed
2256 registers using DWARF shortcut.*
2257 32-95 SGPR0-SGPR63 32 Scalar General Purpose
2259 96-127 *Reserved* *Reserved for frequently accessed
2260 registers using DWARF 1-byte ULEB.*
2261 128 STATUS 32 Status Register.
2262 129-511 *Reserved* *Reserved for future Scalar
2263 Architectural Registers.*
2264 512 VCC_32 32 Vector Condition Code Register
2265 when executing in wavefront 32
2267 513-767 *Reserved* *Reserved for future Vector
2268 Architectural Registers when
2269 executing in wavefront 32 mode.*
2270 768 VCC_64 64 Vector Condition Code Register
2271 when executing in wavefront 64
2273 769-1023 *Reserved* *Reserved for future Vector
2274 Architectural Registers when
2275 executing in wavefront 64 mode.*
2276 1024-1087 *Reserved* *Reserved for padding.*
2277 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers.
2278 1130-1535 *Reserved* *Reserved for future Scalar
2279 General Purpose Registers.*
2280 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers
2281 when executing in wavefront 32
2283 1792-2047 *Reserved* *Reserved for future Vector
2284 General Purpose Registers when
2285 executing in wavefront 32 mode.*
2286 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers
2287 when executing in wavefront 32
2289 2304-2559 *Reserved* *Reserved for future Vector
2290 Accumulation Registers when
2291 executing in wavefront 32 mode.*
2292 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers
2293 when executing in wavefront 64
2295 2816-3071 *Reserved* *Reserved for future Vector
2296 General Purpose Registers when
2297 executing in wavefront 64 mode.*
2298 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers
2299 when executing in wavefront 64
2301 3328-3583 *Reserved* *Reserved for future Vector
2302 Accumulation Registers when
2303 executing in wavefront 64 mode.*
2304 ============== ================= ======== ==================================
2306 The vector registers are represented as the full size for the wavefront. They
2307 are organized as consecutive dwords (32-bits), one per lane, with the dword at
2308 the least significant bit position corresponding to lane 0 and so forth. DWARF
2309 location expressions involving the ``DW_OP_LLVM_offset`` and
2310 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
2311 register corresponding to the lane that is executing the current thread of
2312 execution in languages that are implemented using a SIMD or SIMT execution
2315 If the wavefront size is 32 lanes then the wavefront 32 mode register
2316 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
2317 mode register definitions are used. Some AMDGPU targets support executing in
2318 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
2319 to the wavefront mode of the generated code will be used.
2321 If code is generated to execute in a 32-bit process address space, then the
2322 32-bit process address space register definitions are used. If code is generated
2323 to execute in a 64-bit process address space, then the 64-bit process address
2324 space register definitions are used. The ``amdgcn`` target only supports the
2325 64-bit process address space.
2327 .. _amdgpu-dwarf-memory-space-identifier:
2329 Memory Space Identifier
2330 -----------------------
2332 The DWARF memory space represents the source language memory space. See DWARF
2333 Version 5 section 2.12 which is updated by the *DWARF Extensions For
2334 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-memory-spaces`.
2336 The DWARF memory space mapping used for AMDGPU is defined in
2337 :ref:`amdgpu-dwarf-memory-space-mapping-table`.
2339 .. table:: AMDGPU DWARF Memory Space Mapping
2340 :name: amdgpu-dwarf-memory-space-mapping-table
2342 =========================== ====== =================
2344 ---------------------------------- -----------------
2345 Memory Space Name Value Memory Space
2346 =========================== ====== =================
2347 ``DW_MSPACE_LLVM_none`` 0x0000 Generic (Flat)
2348 ``DW_MSPACE_LLVM_global`` 0x0001 Global
2349 ``DW_MSPACE_LLVM_constant`` 0x0002 Global
2350 ``DW_MSPACE_LLVM_group`` 0x0003 Local (group/LDS)
2351 ``DW_MSPACE_LLVM_private`` 0x0004 Private (Scratch)
2352 ``DW_MSPACE_AMDGPU_region`` 0x8000 Region (GDS)
2353 =========================== ====== =================
2355 The DWARF memory space values defined in the *DWARF Extensions For Heterogeneous
2356 Debugging* section :ref:`amdgpu-dwarf-memory-spaces` are used.
2358 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
2359 available for use for the AMD extension for access to the hardware GDS memory
2360 which is scratchpad memory allocated per device.
2362 For AMDGPU if no ``DW_AT_LLVM_memory_space`` attribute is present, then the
2363 default memory space of ``DW_MSPACE_LLVM_none`` is used.
2365 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
2366 mapping of DWARF memory spaces to DWARF address spaces, including address size
2369 .. _amdgpu-dwarf-address-space-identifier:
2371 Address Space Identifier
2372 ------------------------
2374 DWARF address spaces correspond to target architecture specific linear
2375 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
2376 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-address-spaces`.
2378 The DWARF address space mapping used for AMDGPU is defined in
2379 :ref:`amdgpu-dwarf-address-space-mapping-table`.
2381 .. table:: AMDGPU DWARF Address Space Mapping
2382 :name: amdgpu-dwarf-address-space-mapping-table
2384 ======================================= ===== ======= ======== ===================== =======================
2386 --------------------------------------- ----- ---------------- --------------------- -----------------------
2387 Address Space Name Value Address Bit Size LLVM IR Address Space
2388 --------------------------------------- ----- ------- -------- --------------------- -----------------------
2393 ======================================= ===== ======= ======== ===================== =======================
2394 ``DW_ASPACE_LLVM_none`` 0x00 64 32 Global *default address space*
2395 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat)
2396 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS)
2397 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS)
2399 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane*
2400 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront*
2401 ======================================= ===== ======= ======== ===================== =======================
2403 See :ref:`amdgpu-address-spaces` for information on the AMDGPU LLVM IR address
2404 spaces including address size and NULL value.
2406 The ``DW_ASPACE_LLVM_none`` address space is the default target architecture
2407 address space used in DWARF operations that do not specify an address space. It
2408 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
2409 related operations can refer to addresses in the program code.
2411 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
2412 specify the flat address space. If the address corresponds to an address in the
2413 local address space, then it corresponds to the wavefront that is executing the
2414 focused thread of execution. If the address corresponds to an address in the
2415 private address space, then it corresponds to the lane that is executing the
2416 focused thread of execution for languages that are implemented using a SIMD or
2417 SIMT execution model.
2421 CUDA-like languages such as HIP that do not have address spaces in the
2422 language type system, but do allow variables to be allocated in different
2423 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
2424 address space in the DWARF expression operations as the default address space
2425 is the global address space.
2427 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2428 specify the local address space corresponding to the wavefront that is executing
2429 the focused thread of execution.
2431 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2432 to specify the private address space corresponding to the lane that is executing
2433 the focused thread of execution for languages that are implemented using a SIMD
2434 or SIMT execution model.
2436 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2437 to specify the unswizzled private address space corresponding to the wavefront
2438 that is executing the focused thread of execution. The wavefront view of private
2439 memory is the per wavefront unswizzled backing memory layout defined in
2440 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2441 location for the backing memory of the wavefront (namely the address is not
2442 offset by ``wavefront-scratch-base``). The following formula can be used to
2443 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2444 ``DW_ASPACE_AMDGPU_private_wave`` address:
2448 private-address-wavefront =
2449 ((private-address-lane / 4) * wavefront-size * 4) +
2450 (wavefront-lane-id * 4) + (private-address-lane % 4)
2452 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2453 of the dwords for each lane starting with lane 0 is required, then this
2458 private-address-wavefront =
2459 private-address-lane * wavefront-size
2461 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2462 complete spilled vector register back into a complete vector register in the
2463 CFI. The frame pointer can be a private lane address which is dword aligned,
2464 which can be shifted to multiply by the wavefront size, and then used to form a
2465 private wavefront address that gives a location for a contiguous set of dwords,
2466 one per lane, where the vector register dwords are spilled. The compiler knows
2467 the wavefront size since it generates the code. Note that the type of the
2468 address may have to be converted as the size of a
2469 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2470 ``DW_ASPACE_AMDGPU_private_wave`` address.
2472 .. _amdgpu-dwarf-lane-identifier:
2477 DWARF lane identifies specify a target architecture lane position for hardware
2478 that executes in a SIMD or SIMT manner, and on which a source language maps its
2479 threads of execution onto those lanes. The DWARF lane identifier is pushed by
2480 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2481 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2482 section :ref:`amdgpu-dwarf-operation-expressions`.
2484 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2485 wavefront. It is numbered from 0 to the wavefront size minus 1.
2487 Operation Expressions
2488 ---------------------
2490 DWARF expressions are used to compute program values and the locations of
2491 program objects. See DWARF Version 5 section 2.5 and
2492 :ref:`amdgpu-dwarf-operation-expressions`.
2494 DWARF location descriptions describe how to access storage which includes memory
2495 and registers. When accessing storage on AMDGPU, bytes are ordered with least
2496 significant bytes first, and bits are ordered within bytes with least
2497 significant bits first.
2499 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2500 unwinding vector registers that are spilled under the execution mask to memory:
2501 the zero-single location description is the vector register, and the one-single
2502 location description is the spilled memory location description. The
2503 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2504 memory location description.
2506 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2507 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2508 controlled by the execution mask. An undefined location description together
2509 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2510 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2512 Debugger Information Entry Attributes
2513 -------------------------------------
2515 This section describes how certain debugger information entry attributes are
2516 used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2517 which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2518 :ref:`amdgpu-dwarf-low-level-information` and
2519 :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2521 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2523 ``DW_AT_LLVM_lane_pc``
2524 ~~~~~~~~~~~~~~~~~~~~~~
2526 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2527 location of the separate lanes of a SIMT thread.
2529 If the lane is an active lane then this will be the same as the current program
2532 If the lane is inactive, but was active on entry to the subprogram, then this is
2533 the program location in the subprogram at which execution of the lane is
2534 conceptual positioned.
2536 If the lane was not active on entry to the subprogram, then this will be the
2537 undefined location. A client debugger can check if the lane is part of a valid
2538 work-group by checking that the lane is in the range of the associated
2539 work-group within the grid, accounting for partial work-groups. If it is not,
2540 then the debugger can omit any information for the lane. Otherwise, the debugger
2541 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2542 calling subprogram until it finds a non-undefined location. Conceptually the
2543 lane only has the call frames that it has a non-undefined
2544 ``DW_AT_LLVM_lane_pc``.
2546 The following example illustrates how the AMDGPU backend can generate a DWARF
2547 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2548 following subprogram pseudo code for a target with 64 lanes per wavefront.
2570 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2571 execution mask (``EXEC``) to linearize the control flow. The condition is
2572 evaluated to make a mask of the lanes for which the condition evaluates to true.
2573 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2574 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2575 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2576 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2577 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2578 region. This is shown below. Other approaches are possible, but the basic
2579 concept is the same.
2612 To create the DWARF location list expression that defines the location
2613 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2614 pseudo instruction can be used to annotate the linearized control flow. This can
2615 be done by defining an artificial variable for the lane PC. The DWARF location
2616 list expression created for it is used as the value of the
2617 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2619 A DWARF procedure is defined for each well nested structured control flow region
2620 which provides the conceptual lane program location for a lane if it is not
2621 active (namely it is divergent). The DWARF operation expression for each region
2622 conceptually inherits the value of the immediately enclosing region and modifies
2623 it according to the semantics of the region.
2625 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2626 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2627 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2628 region since the ``THEN`` region has completed.
2630 The lane PC artificial variable is assigned at each region transition. It uses
2631 the immediately enclosing region's DWARF procedure to compute the program
2632 location for each lane assuming they are divergent, and then modifies the result
2633 by inserting the current program location for each lane that the ``EXEC`` mask
2634 indicates is active.
2636 By having separate DWARF procedures for each region, they can be reused to
2637 define the value for any nested region. This reduces the total size of the DWARF
2638 operation expressions.
2640 The following provides an example using pseudo LLVM MIR.
2646 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2647 DW_AT_name = "__uint64";
2648 DW_AT_byte_size = 8;
2649 DW_AT_encoding = DW_ATE_unsigned;
2651 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2652 DW_AT_name = "__active_lane_pc";
2655 DW_OP_LLVM_extend 64, 64;
2656 DW_OP_regval_type EXEC, %uint_64;
2657 DW_OP_LLVM_select_bit_piece 64, 64;
2660 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2661 DW_AT_name = "__divergent_lane_pc";
2663 DW_OP_LLVM_undefined;
2664 DW_OP_LLVM_extend 64, 64;
2667 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2668 DW_OP_call_ref %__divergent_lane_pc;
2669 DW_OP_call_ref %__active_lane_pc;
2673 DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2678 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2679 DW_AT_name = "__divergent_lane_pc_1_then";
2680 DW_AT_location = DIExpression[
2681 DW_OP_call_ref %__divergent_lane_pc;
2682 DW_OP_addrx &lex_1_start;
2684 DW_OP_LLVM_extend 64, 64;
2685 DW_OP_call_ref %__lex_1_save_exec;
2686 DW_OP_deref_type 64, %__uint_64;
2687 DW_OP_LLVM_select_bit_piece 64, 64;
2690 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2691 DW_OP_call_ref %__divergent_lane_pc_1_then;
2692 DW_OP_call_ref %__active_lane_pc;
2696 DBG_VALUE %3, %__lex_1_1_save_exec;
2701 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2702 DW_AT_name = "__divergent_lane_pc_1_1_then";
2703 DW_AT_location = DIExpression[
2704 DW_OP_call_ref %__divergent_lane_pc_1_then;
2705 DW_OP_addrx &lex_1_1_start;
2707 DW_OP_LLVM_extend 64, 64;
2708 DW_OP_call_ref %__lex_1_1_save_exec;
2709 DW_OP_deref_type 64, %__uint_64;
2710 DW_OP_LLVM_select_bit_piece 64, 64;
2713 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2714 DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2715 DW_OP_call_ref %__active_lane_pc;
2720 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2721 DW_AT_name = "__divergent_lane_pc_1_1_else";
2722 DW_AT_location = DIExpression[
2723 DW_OP_call_ref %__divergent_lane_pc_1_then;
2724 DW_OP_addrx &lex_1_1_end;
2726 DW_OP_LLVM_extend 64, 64;
2727 DW_OP_call_ref %__lex_1_1_save_exec;
2728 DW_OP_deref_type 64, %__uint_64;
2729 DW_OP_LLVM_select_bit_piece 64, 64;
2732 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2733 DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2734 DW_OP_call_ref %__active_lane_pc;
2739 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2740 DW_OP_call_ref %__divergent_lane_pc;
2741 DW_OP_call_ref %__active_lane_pc;
2746 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2747 DW_AT_name = "__divergent_lane_pc_1_else";
2748 DW_AT_location = DIExpression[
2749 DW_OP_call_ref %__divergent_lane_pc;
2750 DW_OP_addrx &lex_1_end;
2752 DW_OP_LLVM_extend 64, 64;
2753 DW_OP_call_ref %__lex_1_save_exec;
2754 DW_OP_deref_type 64, %__uint_64;
2755 DW_OP_LLVM_select_bit_piece 64, 64;
2758 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2759 DW_OP_call_ref %__divergent_lane_pc_1_else;
2760 DW_OP_call_ref %__active_lane_pc;
2765 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2766 DW_OP_call_ref %__divergent_lane_pc;
2767 DW_OP_call_ref %__active_lane_pc;
2772 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2773 that are active, with the current program location.
2775 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2776 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2777 instruction, location list entries will be created that describe where the
2778 artificial variables are allocated at any given program location. The compiler
2779 may allocate them to registers or spill them to memory.
2781 The DWARF procedures for each region use the values of the saved execution mask
2782 artificial variables to only update the lanes that are active on entry to the
2783 region. All other lanes retain the value of the enclosing region where they were
2784 last active. If they were not active on entry to the subprogram, then will have
2785 the undefined location description.
2787 Other structured control flow regions can be handled similarly. For example,
2788 loops would set the divergent program location for the region at the end of the
2789 loop. Any lanes active will be in the loop, and any lanes not active must have
2792 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2793 ``IF/THEN/ELSE`` regions.
2795 The DWARF procedures can use the active lane artificial variable described in
2796 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2797 ``EXEC`` mask in order to support whole or quad wavefront mode.
2799 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2801 ``DW_AT_LLVM_active_lane``
2802 ~~~~~~~~~~~~~~~~~~~~~~~~~~
2804 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2805 entry is used to specify the lanes that are conceptually active for a SIMT
2808 The execution mask may be modified to implement whole or quad wavefront mode
2809 operations. For example, all lanes may need to temporarily be made active to
2810 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2811 update it to enable the necessary lanes, perform the operations, and then
2812 restore the ``EXEC`` mask from the saved value. While executing the whole
2813 wavefront region, the conceptual execution mask is the saved value, not the
2816 This is handled by defining an artificial variable for the active lane mask. The
2817 active lane mask artificial variable would be the actual ``EXEC`` mask for
2818 normal regions, and the saved execution mask for regions where the mask is
2819 temporarily updated. The location list expression created for this artificial
2820 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2823 ``DW_AT_LLVM_augmentation``
2824 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2826 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2827 debugger information entry has the following value for the augmentation string:
2833 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2834 extensions used in the DWARF of the compilation unit. The version number
2835 conforms to [SEMVER]_.
2837 Call Frame Information
2838 ----------------------
2840 DWARF Call Frame Information (CFI) describes how a consumer can virtually
2841 *unwind* call frames in a running process or core dump. See DWARF Version 5
2842 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2844 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2846 1. ``augmentation`` string contains the following null-terminated UTF-8 string:
2852 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2853 extensions used in this CIE or to the FDEs that use it. The version number
2854 conforms to [SEMVER]_.
2856 2. ``address_size`` for the ``Global`` address space is defined in
2857 :ref:`amdgpu-dwarf-address-space-identifier`.
2859 3. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2861 4. ``code_alignment_factor`` is 4 bytes.
2865 Add to :ref:`amdgpu-processor-table` table.
2867 5. ``data_alignment_factor`` is 4 bytes.
2871 Add to :ref:`amdgpu-processor-table` table.
2873 6. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2874 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2876 7. ``initial_instructions`` Since a subprogram X with fewer registers can be
2877 called from subprogram Y that has more allocated, X will not change any of
2878 the extra registers as it cannot access them. Therefore, the default rule
2879 for all columns is ``same value``.
2881 For AMDGPU the register number follows the numbering defined in
2882 :ref:`amdgpu-dwarf-register-identifier`.
2884 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2885 the return address to get the address of a byte within the call site
2886 instructions. See DWARF Version 5 section 6.4.4.
2891 See DWARF Version 5 section 6.1.
2893 Lookup By Name Section Header
2894 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2896 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2898 For AMDGPU the lookup by name section header table:
2900 ``augmentation_string_size`` (uword)
2902 Set to the length of the ``augmentation_string`` value which is always a
2905 ``augmentation_string`` (sequence of UTF-8 characters)
2907 Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2913 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2914 extensions used in the DWARF of this index. The version number conforms to
2919 This is different to the DWARF Version 5 definition that requires the first
2920 4 characters to be the vendor ID. But this is consistent with the other
2921 augmentation strings and does allow multiple vendor contributions. However,
2922 backwards compatibility may be more desirable.
2924 Lookup By Address Section Header
2925 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2927 See DWARF Version 5 section 6.1.2.
2929 For AMDGPU the lookup by address section header table:
2931 ``address_size`` (ubyte)
2933 Match the address size for the ``Global`` address space defined in
2934 :ref:`amdgpu-dwarf-address-space-identifier`.
2936 ``segment_selector_size`` (ubyte)
2938 AMDGPU does not use a segment selector so this is 0. The entries in the
2939 ``.debug_aranges`` do not have a segment selector.
2941 Line Number Information
2942 -----------------------
2944 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2946 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2947 The instruction set must be obtained from the ELF file header ``e_flags`` field
2948 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2949 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2953 Should the ``isa`` state machine register be used to indicate if the code is
2954 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2956 For AMDGPU the line number program header fields have the following values (see
2957 DWARF Version 5 section 6.2.4):
2959 ``address_size`` (ubyte)
2960 Matches the address size for the ``Global`` address space defined in
2961 :ref:`amdgpu-dwarf-address-space-identifier`.
2963 ``segment_selector_size`` (ubyte)
2964 AMDGPU does not use a segment selector so this is 0.
2966 ``minimum_instruction_length`` (ubyte)
2967 For GFX9-GFX11 this is 4.
2969 ``maximum_operations_per_instruction`` (ubyte)
2970 For GFX9-GFX11 this is 1.
2972 Source text for online-compiled programs (for example, those compiled by the
2973 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2974 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2975 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2976 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2978 The Clang option used to control source embedding in AMDGPU is defined in
2979 :ref:`amdgpu-clang-debug-options-table`.
2981 .. table:: AMDGPU Clang Debug Options
2982 :name: amdgpu-clang-debug-options-table
2984 ==================== ==================================================
2985 Debug Flag Description
2986 ==================== ==================================================
2987 -g[no-]embed-source Enable/disable embedding source text in DWARF
2988 debug sections. Useful for environments where
2989 source cannot be written to disk, such as
2990 when performing online compilation.
2991 ==================== ==================================================
2996 Enable the embedded source.
2998 ``-gno-embed-source``
2999 Disable the embedded source.
3001 32-Bit and 64-Bit DWARF Formats
3002 -------------------------------
3004 See DWARF Version 5 section 7.4 and
3005 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
3009 * For the ``amdgcn`` target architecture only the 64-bit process address space
3012 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
3013 the 32-bit DWARF format.
3018 For AMDGPU the following values apply for each of the unit headers described in
3019 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
3021 ``address_size`` (ubyte)
3022 Matches the address size for the ``Global`` address space defined in
3023 :ref:`amdgpu-dwarf-address-space-identifier`.
3025 .. _amdgpu-code-conventions:
3030 This section provides code conventions used for each supported target triple OS
3031 (see :ref:`amdgpu-target-triples`).
3036 This section provides code conventions used when the target triple OS is
3037 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
3039 .. _amdgpu-amdhsa-code-object-metadata:
3041 Code Object Metadata
3042 ~~~~~~~~~~~~~~~~~~~~
3044 The code object metadata specifies extensible metadata associated with the code
3045 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
3046 encoding and semantics of this metadata depends on the code object version; see
3047 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
3048 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
3049 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
3050 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
3052 Code object metadata is specified in a note record (see
3053 :ref:`amdgpu-note-records`) and is required when the target triple OS is
3054 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
3055 information necessary to support the HSA compatible runtime kernel queries. For
3056 example, the segment sizes needed in a dispatch packet. In addition, a
3057 high-level language runtime may require other information to be included. For
3058 example, the AMD OpenCL runtime records kernel argument information.
3060 .. _amdgpu-amdhsa-code-object-metadata-v2:
3062 Code Object V2 Metadata
3063 +++++++++++++++++++++++
3066 Code object V2 generation is no longer supported by this version of LLVM.
3068 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
3069 (see :ref:`amdgpu-note-records-v2`).
3071 The metadata is specified as a YAML formatted string (see [YAML]_ and
3076 Is the string null terminated? It probably should not if YAML allows it to
3077 contain null characters, otherwise it should be.
3079 The metadata is represented as a single YAML document comprised of the mapping
3080 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
3083 For boolean values, the string values of ``false`` and ``true`` are used for
3084 false and true respectively.
3086 Additional information can be added to the mappings. To avoid conflicts, any
3087 non-AMD key names should be prefixed by "*vendor-name*.".
3089 .. table:: AMDHSA Code Object V2 Metadata Map
3090 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
3092 ========== ============== ========= =======================================
3093 String Key Value Type Required? Description
3094 ========== ============== ========= =======================================
3095 "Version" sequence of Required - The first integer is the major
3096 2 integers version. Currently 1.
3097 - The second integer is the minor
3098 version. Currently 0.
3099 "Printf" sequence of Each string is encoded information
3100 strings about a printf function call. The
3101 encoded information is organized as
3102 fields separated by colon (':'):
3104 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3109 A 32-bit integer as a unique id for
3110 each printf function call
3113 A 32-bit integer equal to the number
3114 of arguments of printf function call
3117 ``S[i]`` (where i = 0, 1, ... , N-1)
3118 32-bit integers for the size in bytes
3119 of the i-th FormatString argument of
3120 the printf function call
3123 The format string passed to the
3124 printf function call.
3125 "Kernels" sequence of Required Sequence of the mappings for each
3126 mapping kernel in the code object. See
3127 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
3128 for the definition of the mapping.
3129 ========== ============== ========= =======================================
3133 .. table:: AMDHSA Code Object V2 Kernel Metadata Map
3134 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
3136 ================= ============== ========= ================================
3137 String Key Value Type Required? Description
3138 ================= ============== ========= ================================
3139 "Name" string Required Source name of the kernel.
3140 "SymbolName" string Required Name of the kernel
3141 descriptor ELF symbol.
3142 "Language" string Source language of the kernel.
3150 "LanguageVersion" sequence of - The first integer is the major
3152 - The second integer is the
3154 "Attrs" mapping Mapping of kernel attributes.
3156 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
3157 for the mapping definition.
3158 "Args" sequence of Sequence of mappings of the
3159 mapping kernel arguments. See
3160 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
3161 for the definition of the mapping.
3162 "CodeProps" mapping Mapping of properties related to
3163 the kernel code. See
3164 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
3165 for the mapping definition.
3166 ================= ============== ========= ================================
3170 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
3171 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
3173 =================== ============== ========= ==============================
3174 String Key Value Type Required? Description
3175 =================== ============== ========= ==============================
3176 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
3177 3 integers must be >=1 and the dispatch
3178 work-group size X, Y, Z must
3179 correspond to the specified
3180 values. Defaults to 0, 0, 0.
3182 Corresponds to the OpenCL
3183 ``reqd_work_group_size``
3185 "WorkGroupSizeHint" sequence of The dispatch work-group size
3186 3 integers X, Y, Z is likely to be the
3189 Corresponds to the OpenCL
3190 ``work_group_size_hint``
3192 "VecTypeHint" string The name of a scalar or vector
3195 Corresponds to the OpenCL
3196 ``vec_type_hint`` attribute.
3198 "RuntimeHandle" string The external symbol name
3199 associated with a kernel.
3200 OpenCL runtime allocates a
3201 global buffer for the symbol
3202 and saves the kernel's address
3203 to it, which is used for
3204 device side enqueueing. Only
3205 available for device side
3207 =================== ============== ========= ==============================
3211 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
3212 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
3214 ================= ============== ========= ================================
3215 String Key Value Type Required? Description
3216 ================= ============== ========= ================================
3217 "Name" string Kernel argument name.
3218 "TypeName" string Kernel argument type name.
3219 "Size" integer Required Kernel argument size in bytes.
3220 "Align" integer Required Kernel argument alignment in
3221 bytes. Must be a power of two.
3222 "ValueKind" string Required Kernel argument kind that
3223 specifies how to set up the
3224 corresponding argument.
3228 The argument is copied
3229 directly into the kernarg.
3232 A global address space pointer
3233 to the buffer data is passed
3236 "DynamicSharedPointer"
3237 A group address space pointer
3238 to dynamically allocated LDS
3239 is passed in the kernarg.
3242 A global address space
3243 pointer to a S# is passed in
3247 A global address space
3248 pointer to a T# is passed in
3252 A global address space pointer
3253 to an OpenCL pipe is passed in
3257 A global address space pointer
3258 to an OpenCL device enqueue
3259 queue is passed in the
3262 "HiddenGlobalOffsetX"
3263 The OpenCL grid dispatch
3264 global offset for the X
3265 dimension is passed in the
3268 "HiddenGlobalOffsetY"
3269 The OpenCL grid dispatch
3270 global offset for the Y
3271 dimension is passed in the
3274 "HiddenGlobalOffsetZ"
3275 The OpenCL grid dispatch
3276 global offset for the Z
3277 dimension is passed in the
3281 An argument that is not used
3282 by the kernel. Space needs to
3283 be left for it, but it does
3284 not need to be set up.
3286 "HiddenPrintfBuffer"
3287 A global address space pointer
3288 to the runtime printf buffer
3289 is passed in kernarg. Mutually
3291 "HiddenHostcallBuffer".
3293 "HiddenHostcallBuffer"
3294 A global address space pointer
3295 to the runtime hostcall buffer
3296 is passed in kernarg. Mutually
3298 "HiddenPrintfBuffer".
3300 "HiddenDefaultQueue"
3301 A global address space pointer
3302 to the OpenCL device enqueue
3303 queue that should be used by
3304 the kernel by default is
3305 passed in the kernarg.
3307 "HiddenCompletionAction"
3308 A global address space pointer
3309 to help link enqueued kernels into
3310 the ancestor tree for determining
3311 when the parent kernel has finished.
3313 "HiddenMultiGridSyncArg"
3314 A global address space pointer for
3315 multi-grid synchronization is
3316 passed in the kernarg.
3318 "ValueType" string Unused and deprecated. This should no longer
3319 be emitted, but is accepted for compatibility.
3322 "PointeeAlign" integer Alignment in bytes of pointee
3323 type for pointer type kernel
3324 argument. Must be a power
3325 of 2. Only present if
3327 "DynamicSharedPointer".
3328 "AddrSpaceQual" string Kernel argument address space
3329 qualifier. Only present if
3330 "ValueKind" is "GlobalBuffer" or
3331 "DynamicSharedPointer". Values
3343 Is GlobalBuffer only Global
3345 DynamicSharedPointer always
3346 Local? Can HCC allow Generic?
3347 How can Private or Region
3350 "AccQual" string Kernel argument access
3351 qualifier. Only present if
3352 "ValueKind" is "Image" or
3365 "ActualAccQual" string The actual memory accesses
3366 performed by the kernel on the
3367 kernel argument. Only present if
3368 "ValueKind" is "GlobalBuffer",
3369 "Image", or "Pipe". This may be
3370 more restrictive than indicated
3371 by "AccQual" to reflect what the
3372 kernel actual does. If not
3373 present then the runtime must
3374 assume what is implied by
3375 "AccQual" and "IsConst". Values
3382 "IsConst" boolean Indicates if the kernel argument
3383 is const qualified. Only present
3387 "IsRestrict" boolean Indicates if the kernel argument
3388 is restrict qualified. Only
3389 present if "ValueKind" is
3392 "IsVolatile" boolean Indicates if the kernel argument
3393 is volatile qualified. Only
3394 present if "ValueKind" is
3397 "IsPipe" boolean Indicates if the kernel argument
3398 is pipe qualified. Only present
3399 if "ValueKind" is "Pipe".
3403 Can GlobalBuffer be pipe
3406 ================= ============== ========= ================================
3410 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
3411 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
3413 ============================ ============== ========= =====================
3414 String Key Value Type Required? Description
3415 ============================ ============== ========= =====================
3416 "KernargSegmentSize" integer Required The size in bytes of
3418 that holds the values
3421 "GroupSegmentFixedSize" integer Required The amount of group
3425 bytes. This does not
3427 dynamically allocated
3428 group segment memory
3432 "PrivateSegmentFixedSize" integer Required The amount of fixed
3433 private address space
3434 memory required for a
3436 bytes. If the kernel
3438 stack then additional
3440 to this value for the
3442 "KernargSegmentAlign" integer Required The maximum byte
3445 kernarg segment. Must
3447 "WavefrontSize" integer Required Wavefront size. Must
3449 "NumSGPRs" integer Required Number of scalar
3453 includes the special
3455 Scratch (GFX7-GFX10)
3457 GFX8-GFX10). It does
3459 SGPR added if a trap
3465 "NumVGPRs" integer Required Number of vector
3469 "MaxFlatWorkGroupSize" integer Required Maximum flat
3472 kernel in work-items.
3475 ReqdWorkGroupSize if
3477 "NumSpilledSGPRs" integer Number of stores from
3478 a scalar register to
3479 a register allocator
3482 "NumSpilledVGPRs" integer Number of stores from
3483 a vector register to
3484 a register allocator
3487 ============================ ============== ========= =====================
3489 .. _amdgpu-amdhsa-code-object-metadata-v3:
3491 Code Object V3 Metadata
3492 +++++++++++++++++++++++
3495 Code object V3 generation is no longer supported by this version of LLVM.
3497 Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3498 record (see :ref:`amdgpu-note-records-v3-onwards`).
3500 The metadata is represented as Message Pack formatted binary data (see
3501 [MsgPack]_). The top level is a Message Pack map that includes the
3502 keys defined in table
3503 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3506 Additional information can be added to the maps. To avoid conflicts,
3507 any key names should be prefixed by "*vendor-name*." where
3508 ``vendor-name`` can be the name of the vendor and specific vendor
3509 tool that generates the information. The prefix is abbreviated to
3510 simply "." when it appears within a map that has been added by the
3513 .. table:: AMDHSA Code Object V3 Metadata Map
3514 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3516 ================= ============== ========= =======================================
3517 String Key Value Type Required? Description
3518 ================= ============== ========= =======================================
3519 "amdhsa.version" sequence of Required - The first integer is the major
3520 2 integers version. Currently 1.
3521 - The second integer is the minor
3522 version. Currently 0.
3523 "amdhsa.printf" sequence of Each string is encoded information
3524 strings about a printf function call. The
3525 encoded information is organized as
3526 fields separated by colon (':'):
3528 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3533 A 32-bit integer as a unique id for
3534 each printf function call
3537 A 32-bit integer equal to the number
3538 of arguments of printf function call
3541 ``S[i]`` (where i = 0, 1, ... , N-1)
3542 32-bit integers for the size in bytes
3543 of the i-th FormatString argument of
3544 the printf function call
3547 The format string passed to the
3548 printf function call.
3549 "amdhsa.kernels" sequence of Required Sequence of the maps for each
3550 map kernel in the code object. See
3551 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3552 for the definition of the keys included
3554 ================= ============== ========= =======================================
3558 .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3559 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3561 =================================== ============== ========= ================================
3562 String Key Value Type Required? Description
3563 =================================== ============== ========= ================================
3564 ".name" string Required Source name of the kernel.
3565 ".symbol" string Required Name of the kernel
3566 descriptor ELF symbol.
3567 ".language" string Source language of the kernel.
3577 ".language_version" sequence of - The first integer is the major
3579 - The second integer is the
3581 ".args" sequence of Sequence of maps of the
3582 map kernel arguments. See
3583 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3584 for the definition of the keys
3585 included in that map.
3586 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
3587 3 integers must be >=1 and the dispatch
3588 work-group size X, Y, Z must
3589 correspond to the specified
3590 values. Defaults to 0, 0, 0.
3592 Corresponds to the OpenCL
3593 ``reqd_work_group_size``
3595 ".workgroup_size_hint" sequence of The dispatch work-group size
3596 3 integers X, Y, Z is likely to be the
3599 Corresponds to the OpenCL
3600 ``work_group_size_hint``
3602 ".vec_type_hint" string The name of a scalar or vector
3605 Corresponds to the OpenCL
3606 ``vec_type_hint`` attribute.
3608 ".device_enqueue_symbol" string The external symbol name
3609 associated with a kernel.
3610 OpenCL runtime allocates a
3611 global buffer for the symbol
3612 and saves the kernel's address
3613 to it, which is used for
3614 device side enqueueing. Only
3615 available for device side
3617 ".kernarg_segment_size" integer Required The size in bytes of
3619 that holds the values
3622 ".group_segment_fixed_size" integer Required The amount of group
3626 bytes. This does not
3628 dynamically allocated
3629 group segment memory
3633 ".private_segment_fixed_size" integer Required The amount of fixed
3634 private address space
3635 memory required for a
3637 bytes. If the kernel
3639 stack then additional
3641 to this value for the
3643 ".kernarg_segment_align" integer Required The maximum byte
3646 kernarg segment. Must
3648 ".wavefront_size" integer Required Wavefront size. Must
3650 ".sgpr_count" integer Required Number of scalar
3651 registers required by a
3653 GFX6-GFX9. A register
3654 is required if it is
3656 if a higher numbered
3659 includes the special
3665 SGPR added if a trap
3671 ".vgpr_count" integer Required Number of vector
3672 registers required by
3674 GFX6-GFX9. A register
3675 is required if it is
3677 if a higher numbered
3680 ".agpr_count" integer Required Number of accumulator
3681 registers required by
3684 ".max_flat_workgroup_size" integer Required Maximum flat
3687 kernel in work-items.
3690 ReqdWorkGroupSize if
3692 ".sgpr_spill_count" integer Number of stores from
3693 a scalar register to
3694 a register allocator
3697 ".vgpr_spill_count" integer Number of stores from
3698 a vector register to
3699 a register allocator
3702 ".kind" string The kind of the kernel
3710 These kernels must be
3711 invoked after loading
3721 These kernels must be
3724 containing code object
3725 and after all init and
3726 normal kernels in the
3727 same code object have
3731 If omitted, "normal" is
3733 =================================== ============== ========= ================================
3737 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3738 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3740 ====================== ============== ========= ================================
3741 String Key Value Type Required? Description
3742 ====================== ============== ========= ================================
3743 ".name" string Kernel argument name.
3744 ".type_name" string Kernel argument type name.
3745 ".size" integer Required Kernel argument size in bytes.
3746 ".offset" integer Required Kernel argument offset in
3747 bytes. The offset must be a
3748 multiple of the alignment
3749 required by the argument.
3750 ".value_kind" string Required Kernel argument kind that
3751 specifies how to set up the
3752 corresponding argument.
3756 The argument is copied
3757 directly into the kernarg.
3760 A global address space pointer
3761 to the buffer data is passed
3764 "dynamic_shared_pointer"
3765 A group address space pointer
3766 to dynamically allocated LDS
3767 is passed in the kernarg.
3770 A global address space
3771 pointer to a S# is passed in
3775 A global address space
3776 pointer to a T# is passed in
3780 A global address space pointer
3781 to an OpenCL pipe is passed in
3785 A global address space pointer
3786 to an OpenCL device enqueue
3787 queue is passed in the
3790 "hidden_global_offset_x"
3791 The OpenCL grid dispatch
3792 global offset for the X
3793 dimension is passed in the
3796 "hidden_global_offset_y"
3797 The OpenCL grid dispatch
3798 global offset for the Y
3799 dimension is passed in the
3802 "hidden_global_offset_z"
3803 The OpenCL grid dispatch
3804 global offset for the Z
3805 dimension is passed in the
3809 An argument that is not used
3810 by the kernel. Space needs to
3811 be left for it, but it does
3812 not need to be set up.
3814 "hidden_printf_buffer"
3815 A global address space pointer
3816 to the runtime printf buffer
3817 is passed in kernarg. Mutually
3819 "hidden_hostcall_buffer"
3820 before Code Object V5.
3822 "hidden_hostcall_buffer"
3823 A global address space pointer
3824 to the runtime hostcall buffer
3825 is passed in kernarg. Mutually
3827 "hidden_printf_buffer"
3828 before Code Object V5.
3830 "hidden_default_queue"
3831 A global address space pointer
3832 to the OpenCL device enqueue
3833 queue that should be used by
3834 the kernel by default is
3835 passed in the kernarg.
3837 "hidden_completion_action"
3838 A global address space pointer
3839 to help link enqueued kernels into
3840 the ancestor tree for determining
3841 when the parent kernel has finished.
3843 "hidden_multigrid_sync_arg"
3844 A global address space pointer for
3845 multi-grid synchronization is
3846 passed in the kernarg.
3848 ".value_type" string Unused and deprecated. This should no longer
3849 be emitted, but is accepted for compatibility.
3851 ".pointee_align" integer Alignment in bytes of pointee
3852 type for pointer type kernel
3853 argument. Must be a power
3854 of 2. Only present if
3856 "dynamic_shared_pointer".
3857 ".address_space" string Kernel argument address space
3858 qualifier. Only present if
3859 ".value_kind" is "global_buffer" or
3860 "dynamic_shared_pointer". Values
3872 Is "global_buffer" only "global"
3874 "dynamic_shared_pointer" always
3875 "local"? Can HCC allow "generic"?
3876 How can "private" or "region"
3879 ".access" string Kernel argument access
3880 qualifier. Only present if
3881 ".value_kind" is "image" or
3894 ".actual_access" string The actual memory accesses
3895 performed by the kernel on the
3896 kernel argument. Only present if
3897 ".value_kind" is "global_buffer",
3898 "image", or "pipe". This may be
3899 more restrictive than indicated
3900 by ".access" to reflect what the
3901 kernel actual does. If not
3902 present then the runtime must
3903 assume what is implied by
3904 ".access" and ".is_const" . Values
3911 ".is_const" boolean Indicates if the kernel argument
3912 is const qualified. Only present
3916 ".is_restrict" boolean Indicates if the kernel argument
3917 is restrict qualified. Only
3918 present if ".value_kind" is
3921 ".is_volatile" boolean Indicates if the kernel argument
3922 is volatile qualified. Only
3923 present if ".value_kind" is
3926 ".is_pipe" boolean Indicates if the kernel argument
3927 is pipe qualified. Only present
3928 if ".value_kind" is "pipe".
3932 Can "global_buffer" be pipe
3935 ====================== ============== ========= ================================
3937 .. _amdgpu-amdhsa-code-object-metadata-v4:
3939 Code Object V4 Metadata
3940 +++++++++++++++++++++++
3942 Code object V4 metadata is the same as
3943 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3944 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
3946 .. table:: AMDHSA Code Object V4 Metadata Map Changes
3947 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3949 ================= ============== ========= =======================================
3950 String Key Value Type Required? Description
3951 ================= ============== ========= =======================================
3952 "amdhsa.version" sequence of Required - The first integer is the major
3953 2 integers version. Currently 1.
3954 - The second integer is the minor
3955 version. Currently 1.
3956 "amdhsa.target" string Required The target name of the code using the syntax:
3960 <target-triple> [ "-" <target-id> ]
3962 A canonical target ID must be
3963 used. See :ref:`amdgpu-target-triples`
3964 and :ref:`amdgpu-target-id`.
3965 ================= ============== ========= =======================================
3967 .. _amdgpu-amdhsa-code-object-metadata-v5:
3969 Code Object V5 Metadata
3970 +++++++++++++++++++++++
3973 Code object V5 is not the default code object version emitted by this version
3977 Code object V5 metadata is the same as
3978 :ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
3979 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5`, table
3980 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5` and table
3981 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
3983 .. table:: AMDHSA Code Object V5 Metadata Map Changes
3984 :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
3986 ================= ============== ========= =======================================
3987 String Key Value Type Required? Description
3988 ================= ============== ========= =======================================
3989 "amdhsa.version" sequence of Required - The first integer is the major
3990 2 integers version. Currently 1.
3991 - The second integer is the minor
3992 version. Currently 2.
3993 ================= ============== ========= =======================================
3997 .. table:: AMDHSA Code Object V5 Kernel Metadata Map Additions
3998 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5
4000 ============================= ============= ========== =======================================
4001 String Key Value Type Required? Description
4002 ============================= ============= ========== =======================================
4003 ".uses_dynamic_stack" boolean Indicates if the generated machine code
4004 is using a dynamically sized stack.
4005 ".workgroup_processor_mode" boolean (GFX10+) Controls ENABLE_WGP_MODE in
4006 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
4007 ============================= ============= ========== =======================================
4011 .. table:: AMDHSA Code Object V5 Kernel Attribute Metadata Map
4012 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v5-table
4014 =========================== ============== ========= ==============================
4015 String Key Value Type Required? Description
4016 =========================== ============== ========= ==============================
4017 ".uniform_work_group_size" integer Indicates if the kernel
4018 requires that each dimension
4019 of global size is a multiple
4020 of corresponding dimension of
4021 work-group size. Value of 1
4022 implies true and value of 0
4023 implies false. Metadata is
4024 only emitted when value is 1.
4025 =========================== ============== ========= ==============================
4031 .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
4032 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
4034 ====================== ============== ========= ================================
4035 String Key Value Type Required? Description
4036 ====================== ============== ========= ================================
4037 ".value_kind" string Required Kernel argument kind that
4038 specifies how to set up the
4039 corresponding argument.
4041 the same as code object V3 metadata
4042 (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
4043 with the following additions:
4045 "hidden_block_count_x"
4046 The grid dispatch work-group count for the X dimension
4047 is passed in the kernarg. Some languages, such as OpenCL,
4048 support a last work-group in each dimension being partial.
4049 This count only includes the non-partial work-group count.
4050 This is not the same as the value in the AQL dispatch packet,
4051 which has the grid size in work-items.
4053 "hidden_block_count_y"
4054 The grid dispatch work-group count for the Y dimension
4055 is passed in the kernarg. Some languages, such as OpenCL,
4056 support a last work-group in each dimension being partial.
4057 This count only includes the non-partial work-group count.
4058 This is not the same as the value in the AQL dispatch packet,
4059 which has the grid size in work-items. If the grid dimensionality
4060 is 1, then must be 1.
4062 "hidden_block_count_z"
4063 The grid dispatch work-group count for the Z dimension
4064 is passed in the kernarg. Some languages, such as OpenCL,
4065 support a last work-group in each dimension being partial.
4066 This count only includes the non-partial work-group count.
4067 This is not the same as the value in the AQL dispatch packet,
4068 which has the grid size in work-items. If the grid dimensionality
4069 is 1 or 2, then must be 1.
4071 "hidden_group_size_x"
4072 The grid dispatch work-group size for the X dimension is
4073 passed in the kernarg. This size only applies to the
4074 non-partial work-groups. This is the same value as the AQL
4075 dispatch packet work-group size.
4077 "hidden_group_size_y"
4078 The grid dispatch work-group size for the Y dimension is
4079 passed in the kernarg. This size only applies to the
4080 non-partial work-groups. This is the same value as the AQL
4081 dispatch packet work-group size. If the grid dimensionality
4082 is 1, then must be 1.
4084 "hidden_group_size_z"
4085 The grid dispatch work-group size for the Z dimension is
4086 passed in the kernarg. This size only applies to the
4087 non-partial work-groups. This is the same value as the AQL
4088 dispatch packet work-group size. If the grid dimensionality
4089 is 1 or 2, then must be 1.
4091 "hidden_remainder_x"
4092 The grid dispatch work group size of the partial work group
4093 of the X dimension, if it exists. Must be zero if a partial
4094 work group does not exist in the X dimension.
4096 "hidden_remainder_y"
4097 The grid dispatch work group size of the partial work group
4098 of the Y dimension, if it exists. Must be zero if a partial
4099 work group does not exist in the Y dimension.
4101 "hidden_remainder_z"
4102 The grid dispatch work group size of the partial work group
4103 of the Z dimension, if it exists. Must be zero if a partial
4104 work group does not exist in the Z dimension.
4107 The grid dispatch dimensionality. This is the same value
4108 as the AQL dispatch packet dimensionality. Must be a value
4112 A global address space pointer to an initialized memory
4113 buffer that conforms to the requirements of the malloc/free
4114 device library V1 version implementation.
4116 "hidden_private_base"
4117 The high 32 bits of the flat addressing private aperture base.
4118 Only used by GFX8 to allow conversion between private segment
4119 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4121 "hidden_shared_base"
4122 The high 32 bits of the flat addressing shared aperture base.
4123 Only used by GFX8 to allow conversion between shared segment
4124 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4127 A global memory address space pointer to the ROCm runtime
4128 ``struct amd_queue_t`` structure for the HSA queue of the
4129 associated dispatch AQL packet. It is only required for pre-GFX9
4130 devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
4132 ====================== ============== ========= ================================
4139 The HSA architected queuing language (AQL) defines a user space memory interface
4140 that can be used to control the dispatch of kernels, in an agent independent
4141 way. An agent can have zero or more AQL queues created for it using an HSA
4142 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
4143 are 64 bytes) can be placed. See the *HSA Platform System Architecture
4144 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
4146 The packet processor of a kernel agent is responsible for detecting and
4147 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
4148 packet processor is implemented by the hardware command processor (CP),
4149 asynchronous dispatch controller (ADC) and shader processor input controller
4152 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
4153 the kernel mode driver to initialize and register the AQL queue with CP.
4155 To dispatch a kernel the following actions are performed. This can occur in the
4156 CPU host program, or from an HSA kernel executing on a GPU.
4158 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
4159 executed is obtained.
4160 2. A pointer to the kernel descriptor (see
4161 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
4162 It must be for a kernel that is contained in a code object that was loaded
4163 by an HSA compatible runtime on the kernel agent with which the AQL queue is
4165 3. Space is allocated for the kernel arguments using the HSA compatible runtime
4166 allocator for a memory region with the kernarg property for the kernel agent
4167 that will execute the kernel. It must be at least 16-byte aligned.
4168 4. Kernel argument values are assigned to the kernel argument memory
4169 allocation. The layout is defined in the *HSA Programmer's Language
4170 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
4171 kernel argument memory in the same way constant memory is accessed. (Note
4172 that the HSA specification allows an implementation to copy the kernel
4173 argument contents to another location that is accessed by the kernel.)
4174 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
4175 runtime api uses 64-bit atomic operations to reserve space in the AQL queue
4176 for the packet. The packet must be set up, and the final write must use an
4177 atomic store release to set the packet kind to ensure the packet contents are
4178 visible to the kernel agent. AQL defines a doorbell signal mechanism to
4179 notify the kernel agent that the AQL queue has been updated. These rules, and
4180 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
4181 System Architecture Specification* [HSA]_.
4182 6. A kernel dispatch packet includes information about the actual dispatch,
4183 such as grid and work-group size, together with information from the code
4184 object about the kernel, such as segment sizes. The HSA compatible runtime
4185 queries on the kernel symbol can be used to obtain the code object values
4186 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
4187 7. CP executes micro-code and is responsible for detecting and setting up the
4188 GPU to execute the wavefronts of a kernel dispatch.
4189 8. CP ensures that when the a wavefront starts executing the kernel machine
4190 code, the scalar general purpose registers (SGPR) and vector general purpose
4191 registers (VGPR) are set up as required by the machine code. The required
4192 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
4193 register state is defined in
4194 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4195 9. The prolog of the kernel machine code (see
4196 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
4197 before continuing executing the machine code that corresponds to the kernel.
4198 10. When the kernel dispatch has completed execution, CP signals the completion
4199 signal specified in the kernel dispatch packet if not 0.
4201 .. _amdgpu-amdhsa-memory-spaces:
4206 The memory space properties are:
4208 .. table:: AMDHSA Memory Spaces
4209 :name: amdgpu-amdhsa-memory-spaces-table
4211 ================= =========== ======== ======= ==================
4212 Memory Space Name HSA Segment Hardware Address NULL Value
4214 ================= =========== ======== ======= ==================
4215 Private private scratch 32 0x00000000
4216 Local group LDS 32 0xFFFFFFFF
4217 Global global global 64 0x0000000000000000
4218 Constant constant *same as 64 0x0000000000000000
4220 Generic flat flat 64 0x0000000000000000
4221 Region N/A GDS 32 *not implemented
4223 ================= =========== ======== ======= ==================
4225 The global and constant memory spaces both use global virtual addresses, which
4226 are the same virtual address space used by the CPU. However, some virtual
4227 addresses may only be accessible to the CPU, some only accessible by the GPU,
4230 Using the constant memory space indicates that the data will not change during
4231 the execution of the kernel. This allows scalar read instructions to be
4232 used. The vector and scalar L1 caches are invalidated of volatile data before
4233 each kernel dispatch execution to allow constant memory to change values between
4236 The local memory space uses the hardware Local Data Store (LDS) which is
4237 automatically allocated when the hardware creates work-groups of wavefronts, and
4238 freed when all the wavefronts of a work-group have terminated. The data store
4239 (DS) instructions can be used to access it.
4241 The private memory space uses the hardware scratch memory support. If the kernel
4242 uses scratch, then the hardware allocates memory that is accessed using
4243 wavefront lane dword (4 byte) interleaving. The mapping used from private
4244 address to physical address is:
4246 ``wavefront-scratch-base +
4247 (private-address * wavefront-size * 4) +
4248 (wavefront-lane-id * 4)``
4250 There are different ways that the wavefront scratch base address is determined
4251 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
4252 memory can be accessed in an interleaved manner using buffer instruction with
4253 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
4254 instructions, or by flat instructions. If each lane of a wavefront accesses the
4255 same private address, the interleaving results in adjacent dwords being accessed
4256 and hence requires fewer cache lines to be fetched. Multi-dword access is not
4257 supported except by flat and scratch instructions in GFX9-GFX11.
4259 The generic address space uses the hardware flat address support available in
4260 GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
4261 local apertures), that are outside the range of addressible global memory, to
4262 map from a flat address to a private or local address.
4264 FLAT instructions can take a flat address and access global, private (scratch)
4265 and group (LDS) memory depending on if the address is within one of the
4266 aperture ranges. Flat access to scratch requires hardware aperture setup and
4267 setup in the kernel prologue (see
4268 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
4269 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
4270 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
4272 To convert between a segment address and a flat address the base address of the
4273 apertures address can be used. For GFX7-GFX8 these are available in the
4274 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
4275 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
4276 GFX9-GFX11 the aperture base addresses are directly available as inline constant
4277 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
4278 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
4279 which makes it easier to convert from flat to segment or segment to flat.
4284 Image and sample handles created by an HSA compatible runtime (see
4285 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
4286 object respectively. In order to support the HSA ``query_sampler`` operations
4287 two extra dwords are used to store the HSA BRIG enumeration values for the
4288 queries that are not trivially deducible from the S# representation.
4293 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
4294 are 64-bit addresses of a structure allocated in memory accessible from both the
4295 CPU and GPU. The structure is defined by the runtime and subject to change
4296 between releases. For example, see [AMD-ROCm-github]_.
4298 .. _amdgpu-amdhsa-hsa-aql-queue:
4303 The HSA AQL queue structure is defined by an HSA compatible runtime (see
4304 :ref:`amdgpu-os`) and subject to change between releases. For example, see
4305 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
4306 certain language features such as the flat address aperture bases. It also
4307 contains fields used by CP such as managing the allocation of scratch memory.
4309 .. _amdgpu-amdhsa-kernel-descriptor:
4314 A kernel descriptor consists of the information needed by CP to initiate the
4315 execution of a kernel, including the entry point address of the machine code
4316 that implements the kernel.
4318 Code Object V3 Kernel Descriptor
4319 ++++++++++++++++++++++++++++++++
4321 CP microcode requires the Kernel descriptor to be allocated on 64-byte
4324 The fields used by CP for code objects before V3 also match those specified in
4325 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
4327 .. table:: Code Object V3 Kernel Descriptor
4328 :name: amdgpu-amdhsa-kernel-descriptor-v3-table
4330 ======= ======= =============================== ============================
4331 Bits Size Field Name Description
4332 ======= ======= =============================== ============================
4333 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
4334 address space memory
4335 required for a work-group
4336 in bytes. This does not
4337 include any dynamically
4338 allocated local address
4339 space memory that may be
4340 added when the kernel is
4342 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
4343 private address space
4344 memory required for a
4345 work-item in bytes. When
4346 this cannot be predicted,
4347 code object v4 and older
4348 sets this value to be
4349 higher than the minimum
4351 95:64 4 bytes KERNARG_SIZE The size of the kernarg
4352 memory pointed to by the
4353 AQL dispatch packet. The
4354 kernarg memory is used to
4355 pass arguments to the
4358 * If the kernarg pointer in
4359 the dispatch packet is NULL
4360 then there are no kernel
4362 * If the kernarg pointer in
4363 the dispatch packet is
4364 not NULL and this value
4365 is 0 then the kernarg
4368 * If the kernarg pointer in
4369 the dispatch packet is
4370 not NULL and this value
4371 is not 0 then the value
4372 specifies the kernarg
4373 memory size in bytes. It
4374 is recommended to provide
4375 a value as it may be used
4376 by CP to optimize making
4378 visible to the kernel
4381 127:96 4 bytes Reserved, must be 0.
4382 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
4385 descriptor to kernel's
4386 entry point instruction
4387 which must be 256 byte
4389 351:272 20 Reserved, must be 0.
4391 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9
4392 Reserved, must be 0.
4395 program settings used by
4397 ``COMPUTE_PGM_RSRC3``
4400 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
4403 program settings used by
4405 ``COMPUTE_PGM_RSRC3``
4408 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx12-table`.
4409 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
4410 program settings used by
4412 ``COMPUTE_PGM_RSRC1``
4415 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
4416 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
4417 program settings used by
4419 ``COMPUTE_PGM_RSRC2``
4422 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
4423 458:448 7 bits *See separate bits below.* Enable the setup of the
4424 SGPR user data registers
4426 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4428 The total number of SGPR
4430 requested must not exceed
4431 16 and match value in
4432 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
4433 Any requests beyond 16
4435 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties*
4437 :ref:`amdgpu-processor-table`
4438 specifies *Architected flat
4439 scratch* then not supported
4441 >449 1 bit ENABLE_SGPR_DISPATCH_PTR
4442 >450 1 bit ENABLE_SGPR_QUEUE_PTR
4443 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR
4444 >452 1 bit ENABLE_SGPR_DISPATCH_ID
4445 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties*
4447 :ref:`amdgpu-processor-table`
4448 specifies *Architected flat
4449 scratch* then not supported
4451 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT
4453 457:455 3 bits Reserved, must be 0.
4454 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9
4455 Reserved, must be 0.
4458 wavefront size 64 mode.
4460 native wavefront size
4462 459 1 bit USES_DYNAMIC_STACK Indicates if the generated
4463 machine code is using a
4464 dynamically sized stack.
4465 This is only set in code
4466 object v5 and later.
4467 463:460 4 bits Reserved, must be 0.
4468 470:464 7 bits KERNARG_PRELOAD_SPEC_LENGTH GFX6-GFX9
4469 - Reserved, must be 0.
4471 - The number of dwords from
4472 the kernarg segment to preload
4473 into User SGPRs before kernel
4475 :ref:`amdgpu-amdhsa-kernarg-preload`).
4476 479:471 9 bits KERNARG_PRELOAD_SPEC_OFFSET GFX6-GFX9
4477 - Reserved, must be 0.
4479 - An offset in dwords into the
4480 kernarg segment to begin
4481 preloading data into User
4483 :ref:`amdgpu-amdhsa-kernarg-preload`).
4484 511:480 4 bytes Reserved, must be 0.
4485 512 **Total size 64 bytes.**
4486 ======= ====================================================================
4490 .. table:: compute_pgm_rsrc1 for GFX6-GFX12
4491 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table
4493 ======= ======= =============================== ===========================================================================
4494 Bits Size Field Name Description
4495 ======= ======= =============================== ===========================================================================
4496 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
4497 blocks used by each work-item;
4498 granularity is device
4503 - max(0, ceil(vgprs_used / 4) - 1)
4506 - vgprs_used = align(arch_vgprs, 4)
4508 - max(0, ceil(vgprs_used / 8) - 1)
4509 GFX10-GFX11 (wavefront size 64)
4511 - max(0, ceil(vgprs_used / 4) - 1)
4512 GFX10-GFX11 (wavefront size 32)
4514 - max(0, ceil(vgprs_used / 8) - 1)
4516 Where vgprs_used is defined
4517 as the highest VGPR number
4518 explicitly referenced plus
4521 Used by CP to set up
4522 ``COMPUTE_PGM_RSRC1.VGPRS``.
4525 :ref:`amdgpu-assembler`
4527 automatically for the
4528 selected processor from
4529 values provided to the
4530 `.amdhsa_kernel` directive
4532 `.amdhsa_next_free_vgpr`
4533 nested directive (see
4534 :ref:`amdhsa-kernel-directives-table`).
4535 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4536 blocks used by a wavefront;
4537 granularity is device
4542 - max(0, ceil(sgprs_used / 8) - 1)
4545 - 2 * max(0, ceil(sgprs_used / 16) - 1)
4547 Reserved, must be 0.
4552 defined as the highest
4553 SGPR number explicitly
4554 referenced plus one, plus
4555 a target specific number
4556 of additional special
4558 FLAT_SCRATCH (GFX7+) and
4559 XNACK_MASK (GFX8+), and
4562 limitations. It does not
4563 include the 16 SGPRs added
4564 if a trap handler is
4568 limitations and special
4569 SGPR layout are defined in
4571 documentation, which can
4573 :ref:`amdgpu-processors`
4576 Used by CP to set up
4577 ``COMPUTE_PGM_RSRC1.SGPRS``.
4580 :ref:`amdgpu-assembler`
4582 automatically for the
4583 selected processor from
4584 values provided to the
4585 `.amdhsa_kernel` directive
4587 `.amdhsa_next_free_sgpr`
4588 and `.amdhsa_reserve_*`
4589 nested directives (see
4590 :ref:`amdhsa-kernel-directives-table`).
4591 11:10 2 bits PRIORITY Must be 0.
4593 Start executing wavefront
4594 at the specified priority.
4596 CP is responsible for
4598 ``COMPUTE_PGM_RSRC1.PRIORITY``.
4599 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
4600 with specified rounding
4603 precision floating point
4606 Floating point rounding
4607 mode values are defined in
4608 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4610 Used by CP to set up
4611 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4612 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
4613 with specified rounding
4614 denorm mode for half/double (16
4615 and 64-bit) floating point
4616 precision floating point
4619 Floating point rounding
4620 mode values are defined in
4621 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4623 Used by CP to set up
4624 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4625 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
4626 with specified denorm mode
4629 precision floating point
4632 Floating point denorm mode
4633 values are defined in
4634 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4636 Used by CP to set up
4637 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4638 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
4639 with specified denorm mode
4641 and 64-bit) floating point
4642 precision floating point
4645 Floating point denorm mode
4646 values are defined in
4647 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4649 Used by CP to set up
4650 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4651 20 1 bit PRIV Must be 0.
4653 Start executing wavefront
4654 in privilege trap handler
4657 CP is responsible for
4659 ``COMPUTE_PGM_RSRC1.PRIV``.
4660 21 1 bit ENABLE_DX10_CLAMP GFX9-GFX11
4661 Wavefront starts execution
4662 with DX10 clamp mode
4663 enabled. Used by the vector
4664 ALU to force DX10 style
4665 treatment of NaN's (when
4666 set, clamp NaN to zero,
4670 Used by CP to set up
4671 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4673 If 1, wavefronts are scheduled
4674 in a round-robin fashion with
4675 respect to the other wavefronts
4676 of the SIMD. Otherwise, wavefronts
4677 are scheduled in oldest age order.
4679 CP is responsible for filling in
4680 ``COMPUTE_PGM_RSRC1.WG_RR_EN``.
4681 22 1 bit DEBUG_MODE Must be 0.
4683 Start executing wavefront
4684 in single step mode.
4686 CP is responsible for
4688 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4689 23 1 bit ENABLE_IEEE_MODE GFX9-GFX11
4690 Wavefront starts execution
4692 enabled. Floating point
4693 opcodes that support
4694 exception flag gathering
4695 will quiet and propagate
4696 signaling-NaN inputs per
4697 IEEE 754-2008. Min_dx10 and
4698 max_dx10 become IEEE
4699 754-2008 compliant due to
4700 signaling-NaN propagation
4703 Used by CP to set up
4704 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4706 Reserved. Must be 0.
4707 24 1 bit BULKY Must be 0.
4709 Only one work-group allowed
4710 to execute on a compute
4713 CP is responsible for
4715 ``COMPUTE_PGM_RSRC1.BULKY``.
4716 25 1 bit CDBG_USER Must be 0.
4718 Flag that can be used to
4719 control debugging code.
4721 CP is responsible for
4723 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4724 26 1 bit FP16_OVFL GFX6-GFX8
4725 Reserved, must be 0.
4727 Wavefront starts execution
4728 with specified fp16 overflow
4731 - If 0, fp16 overflow generates
4733 - If 1, fp16 overflow that is the
4734 result of an +/-INF input value
4735 or divide by 0 produces a +/-INF,
4736 otherwise clamps computed
4737 overflow to +/-MAX_FP16 as
4740 Used by CP to set up
4741 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4742 28:27 2 bits Reserved, must be 0.
4743 29 1 bit WGP_MODE GFX6-GFX9
4744 Reserved, must be 0.
4746 - If 0 execute work-groups in
4747 CU wavefront execution mode.
4748 - If 1 execute work-groups on
4749 in WGP wavefront execution mode.
4751 See :ref:`amdgpu-amdhsa-memory-model`.
4753 Used by CP to set up
4754 ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4755 30 1 bit MEM_ORDERED GFX6-GFX9
4756 Reserved, must be 0.
4758 Controls the behavior of the
4759 s_waitcnt's vmcnt and vscnt
4762 - If 0 vmcnt reports completion
4763 of load and atomic with return
4764 out of order with sample
4765 instructions, and the vscnt
4766 reports the completion of
4767 store and atomic without
4769 - If 1 vmcnt reports completion
4770 of load, atomic with return
4771 and sample instructions in
4772 order, and the vscnt reports
4773 the completion of store and
4774 atomic without return in order.
4776 Used by CP to set up
4777 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4778 31 1 bit FWD_PROGRESS GFX6-GFX9
4779 Reserved, must be 0.
4781 - If 0 execute SIMD wavefronts
4782 using oldest first policy.
4783 - If 1 execute SIMD wavefronts to
4784 ensure wavefronts will make some
4787 Used by CP to set up
4788 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4789 32 **Total size 4 bytes**
4790 ======= ===================================================================================================================
4794 .. table:: compute_pgm_rsrc2 for GFX6-GFX12
4795 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table
4797 ======= ======= =============================== ===========================================================================
4798 Bits Size Field Name Description
4799 ======= ======= =============================== ===========================================================================
4800 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the
4802 * If the *Target Properties*
4804 :ref:`amdgpu-processor-table`
4807 scratch* then enable the
4809 wavefront scratch offset
4810 system register (see
4811 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4812 * If the *Target Properties*
4814 :ref:`amdgpu-processor-table`
4815 specifies *Architected
4816 flat scratch* then enable
4818 FLAT_SCRATCH register
4820 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4822 Used by CP to set up
4823 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4824 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
4826 registers requested. This
4827 number must be greater than
4828 or equal to the number of user
4829 data registers enabled.
4831 Used by CP to set up
4832 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4833 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
4836 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4837 which is set by the CP if
4838 the runtime has installed a
4840 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
4841 system SGPR register for
4842 the work-group id in the X
4844 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4846 Used by CP to set up
4847 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4848 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
4849 system SGPR register for
4850 the work-group id in the Y
4852 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4854 Used by CP to set up
4855 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4856 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
4857 system SGPR register for
4858 the work-group id in the Z
4860 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4862 Used by CP to set up
4863 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4864 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
4865 system SGPR register for
4866 work-group information (see
4867 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4869 Used by CP to set up
4870 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4871 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
4872 VGPR system registers used
4873 for the work-item ID.
4874 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4877 Used by CP to set up
4878 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4879 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
4881 Wavefront starts execution
4883 exceptions enabled which
4884 are generated when L1 has
4885 witnessed a thread access
4889 CP is responsible for
4890 filling in the address
4892 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4893 according to what the
4895 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
4897 Wavefront starts execution
4898 with memory violation
4899 exceptions exceptions
4900 enabled which are generated
4901 when a memory violation has
4902 occurred for this wavefront from
4904 (write-to-read-only-memory,
4905 mis-aligned atomic, LDS
4906 address out of range,
4907 illegal address, etc.).
4911 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4912 according to what the
4914 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
4916 CP uses the rounded value
4917 from the dispatch packet,
4918 not this value, as the
4919 dispatch may contain
4920 dynamically allocated group
4921 segment memory. CP writes
4923 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4925 Amount of group segment
4926 (LDS) to allocate for each
4927 work-group. Granularity is
4931 roundup(lds-size / (64 * 4))
4933 roundup(lds-size / (128 * 4))
4935 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
4936 _INVALID_OPERATION with specified exceptions
4939 Used by CP to set up
4940 ``COMPUTE_PGM_RSRC2.EXCP_EN``
4941 (set from bits 0..6).
4945 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
4946 _SOURCE input operands is a
4948 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
4949 _DIVISION_BY_ZERO Zero
4950 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
4952 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
4954 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
4956 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
4957 _ZERO (rcp_iflag_f32 instruction
4959 31 1 bit Reserved, must be 0.
4960 32 **Total size 4 bytes.**
4961 ======= ===================================================================================================================
4965 .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
4966 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4968 ======= ======= =============================== ===========================================================================
4969 Bits Size Field Name Description
4970 ======= ======= =============================== ===========================================================================
4971 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4.
4972 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4973 63 - accum-offset = 256.
4974 6:15 10 Reserved, must be 0.
4976 16 1 bit TG_SPLIT - If 0 the waves of a work-group are
4977 launched in the same CU.
4978 - If 1 the waves of a work-group can be
4979 launched in different CUs. The waves
4980 cannot use S_BARRIER or LDS.
4981 17:31 15 Reserved, must be 0.
4983 32 **Total size 4 bytes.**
4984 ======= ===================================================================================================================
4988 .. table:: compute_pgm_rsrc3 for GFX10-GFX12
4989 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx12-table
4991 ======= ======= =============================== ===========================================================================
4992 Bits Size Field Name Description
4993 ======= ======= =============================== ===========================================================================
4994 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPR blocks when executing in subvector mode. For
4995 wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
4996 of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
4997 not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
4998 9:4 6 bits INST_PREF_SIZE GFX10
4999 Reserved, must be 0.
5001 Number of instruction bytes to prefetch, starting at the kernel's entry
5002 point instruction, before wavefront starts execution. The value is 0..63
5003 with a granularity of 128 bytes.
5004 10 1 bit TRAP_ON_START GFX10
5005 Reserved, must be 0.
5009 If 1, wavefront starts execution by trapping into the trap handler.
5011 CP is responsible for filling in the trap on start bit in
5012 ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
5014 11 1 bit TRAP_ON_END GFX10
5015 Reserved, must be 0.
5019 If 1, wavefront execution terminates by trapping into the trap handler.
5021 CP is responsible for filling in the trap on end bit in
5022 ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
5023 30:12 19 bits Reserved, must be 0.
5024 31 1 bit IMAGE_OP GFX10
5025 Reserved, must be 0.
5027 If 1, the kernel execution contains image instructions. If executed as
5028 part of a graphics pipeline, image read instructions will stall waiting
5029 for any necessary ``WAIT_SYNC`` fence to be performed in order to
5030 indicate that earlier pipeline stages have completed writing to the
5033 Not used for compute kernels that are not part of a graphics pipeline and
5035 32 **Total size 4 bytes.**
5036 ======= ===================================================================================================================
5040 .. table:: Floating Point Rounding Mode Enumeration Values
5041 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
5043 ====================================== ===== ==============================
5044 Enumeration Name Value Description
5045 ====================================== ===== ==============================
5046 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
5047 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
5048 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
5049 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
5050 ====================================== ===== ==============================
5053 .. table:: Extended FLT_ROUNDS Enumeration Values
5054 :name: amdgpu-rounding-mode-enumeration-values-table
5056 +------------------------+---------------+-------------------+--------------------+----------+
5057 | | F32 NEAR_EVEN | F32 PLUS_INFINITY | F32 MINUS_INFINITY | F32 ZERO |
5058 +------------------------+---------------+-------------------+--------------------+----------+
5059 | F64/F16 NEAR_EVEN | 1 | 11 | 14 | 17 |
5060 +------------------------+---------------+-------------------+--------------------+----------+
5061 | F64/F16 PLUS_INFINITY | 8 | 2 | 15 | 18 |
5062 +------------------------+---------------+-------------------+--------------------+----------+
5063 | F64/F16 MINUS_INFINITY | 9 | 12 | 3 | 19 |
5064 +------------------------+---------------+-------------------+--------------------+----------+
5065 | F64/F16 ZERO | 10 | 13 | 16 | 0 |
5066 +------------------------+---------------+-------------------+--------------------+----------+
5070 .. table:: Floating Point Denorm Mode Enumeration Values
5071 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
5073 ====================================== ===== ====================================
5074 Enumeration Name Value Description
5075 ====================================== ===== ====================================
5076 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination Denorms
5077 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
5078 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
5079 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
5080 ====================================== ===== ====================================
5082 Denormal flushing is sign respecting. i.e. the behavior expected by
5083 ``"denormal-fp-math"="preserve-sign"``. The behavior is undefined with
5084 ``"denormal-fp-math"="positive-zero"``
5088 .. table:: System VGPR Work-Item ID Enumeration Values
5089 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
5091 ======================================== ===== ============================
5092 Enumeration Name Value Description
5093 ======================================== ===== ============================
5094 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
5096 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
5098 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
5100 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
5101 ======================================== ===== ============================
5103 .. _amdgpu-amdhsa-initial-kernel-execution-state:
5105 Initial Kernel Execution State
5106 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5108 This section defines the register state that will be set up by the packet
5109 processor prior to the start of execution of every wavefront. This is limited by
5110 the constraints of the hardware controllers of CP/ADC/SPI.
5112 The order of the SGPR registers is defined, but the compiler can specify which
5113 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
5114 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
5115 for enabled registers are dense starting at SGPR0: the first enabled register is
5116 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
5119 The initial SGPRs comprise up to 16 User SGPRs that are set by CP and apply to
5120 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
5121 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
5122 actually initialized. These are then immediately followed by the System SGPRs
5123 that are set up by ADC/SPI and can have different values for each wavefront of
5126 SGPR register initial state is defined in
5127 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
5129 .. table:: SGPR Register Set Up Order
5130 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
5132 ========== ========================== ====== ==============================
5133 SGPR Order Name Number Description
5134 (kernel descriptor enable of
5136 ========== ========================== ====== ==============================
5137 First Private Segment Buffer 4 See
5138 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
5140 then Dispatch Ptr 2 64-bit address of AQL dispatch
5141 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
5143 then Queue Ptr 2 64-bit address of amd_queue_t
5144 (enable_sgpr_queue_ptr) object for AQL queue on which
5145 the dispatch packet was
5147 then Kernarg Segment Ptr 2 64-bit address of Kernarg
5148 (enable_sgpr_kernarg segment. This is directly
5149 _segment_ptr) copied from the
5150 kernarg_address in the kernel
5153 Having CP load it once avoids
5154 loading it at the beginning of
5156 then Dispatch Id 2 64-bit Dispatch ID of the
5157 (enable_sgpr_dispatch_id) dispatch packet being
5159 then Flat Scratch Init 2 See
5160 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5162 then Preloaded Kernargs N/A See
5163 (kernarg_preload_spec :ref:`amdgpu-amdhsa-kernarg-preload`.
5165 then Private Segment Size 1 The 32-bit byte size of a
5166 (enable_sgpr_private single work-item's memory
5167 _segment_size) allocation. This is the
5168 value from the kernel
5169 dispatch packet Private
5170 Segment Byte Size rounded up
5171 by CP to a multiple of
5174 Having CP load it once avoids
5175 loading it at the beginning of
5178 This is not used for
5179 GFX7-GFX8 since it is the same
5180 value as the second SGPR of
5181 Flat Scratch Init. However, it
5182 may be needed for GFX9-GFX11 which
5183 changes the meaning of the
5184 Flat Scratch Init value.
5185 then Work-Group Id X 1 32-bit work-group id in X
5186 (enable_sgpr_workgroup_id dimension of grid for
5188 then Work-Group Id Y 1 32-bit work-group id in Y
5189 (enable_sgpr_workgroup_id dimension of grid for
5191 then Work-Group Id Z 1 32-bit work-group id in Z
5192 (enable_sgpr_workgroup_id dimension of grid for
5194 then Work-Group Info 1 {first_wavefront, 14'b0000,
5195 (enable_sgpr_workgroup ordered_append_term[10:0],
5196 _info) threadgroup_size_in_wavefronts[5:0]}
5197 then Scratch Wavefront Offset 1 See
5198 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5199 _segment_wavefront_offset) and
5200 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
5201 ========== ========================== ====== ==============================
5203 The order of the VGPR registers is defined, but the compiler can specify which
5204 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
5205 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
5206 for enabled registers are dense starting at VGPR0: the first enabled register is
5207 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
5210 There are different methods used for the VGPR initial state:
5212 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
5213 specifies otherwise, a separate VGPR register is used per work-item ID. The
5214 VGPR register initial state for this method is defined in
5215 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
5216 * If *Target Properties* column of :ref:`amdgpu-processor-table`
5217 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
5218 for all work-item IDs. The register layout for this method is defined in
5219 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
5221 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
5222 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
5224 ========== ========================== ====== ==============================
5225 VGPR Order Name Number Description
5226 (kernel descriptor enable of
5228 ========== ========================== ====== ==============================
5229 First Work-Item Id X 1 32-bit work-item id in X
5230 (Always initialized) dimension of work-group for
5232 then Work-Item Id Y 1 32-bit work-item id in Y
5233 (enable_vgpr_workitem_id dimension of work-group for
5234 > 0) wavefront lane.
5235 then Work-Item Id Z 1 32-bit work-item id in Z
5236 (enable_vgpr_workitem_id dimension of work-group for
5237 > 1) wavefront lane.
5238 ========== ========================== ====== ==============================
5242 .. table:: Register Layout for Packed Work-Item ID Method
5243 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
5245 ======= ======= ================ =========================================
5246 Bits Size Field Name Description
5247 ======= ======= ================ =========================================
5248 0:9 10 bits Work-Item Id X Work-item id in X
5249 dimension of work-group for
5254 10:19 10 bits Work-Item Id Y Work-item id in Y
5255 dimension of work-group for
5258 Initialized if enable_vgpr_workitem_id >
5259 0, otherwise set to 0.
5260 20:29 10 bits Work-Item Id Z Work-item id in Z
5261 dimension of work-group for
5264 Initialized if enable_vgpr_workitem_id >
5265 1, otherwise set to 0.
5266 30:31 2 bits Reserved, set to 0.
5267 ======= ======= ================ =========================================
5269 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
5271 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
5273 2. Work-group Id registers X, Y, Z are set by ADC which supports any
5274 combination including none.
5275 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
5276 its value cannot be included with the flat scratch init value which is per
5277 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
5278 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
5280 5. Flat Scratch register pair initialization is described in
5281 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5283 The global segment can be accessed either using buffer instructions (GFX6 which
5284 has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
5285 instructions (GFX9-GFX11).
5287 If buffer operations are used, then the compiler can generate a V# with the
5288 following properties:
5292 * ATC: 1 if IOMMU present (such as APU)
5294 * MTYPE set to support memory coherence that matches the runtime (such as CC for
5295 APU and NC for dGPU).
5297 .. _amdgpu-amdhsa-kernarg-preload:
5299 Preloaded Kernel Arguments
5300 ++++++++++++++++++++++++++
5302 On hardware that supports this feature, kernel arguments can be preloaded into
5303 User SGPRs, up to the maximum number of User SGPRs available. The allocation of
5304 Preload SGPRs occurs directly after the last enabled non-kernarg preload User
5305 SGPR. (See :ref:`amdgpu-amdhsa-initial-kernel-execution-state`)
5307 The data preloaded is copied from the kernarg segment, the amount of data is
5308 determined by the value specified in the kernarg_preload_spec_length field of
5309 the kernel descriptor. This data is then loaded into consecutive User SGPRs. The
5310 number of SGPRs receiving preloaded kernarg data corresponds with the value
5311 given by kernarg_preload_spec_length. The preloading starts at the dword offset
5312 within the kernarg segment, which is specified by the
5313 kernarg_preload_spec_offset field.
5315 If the kernarg_preload_spec_length is non-zero, the CP firmware will append an
5316 additional 256 bytes to the kernel_code_entry_byte_offset. This addition
5317 facilitates the incorporation of a prologue to the kernel entry to handle cases
5318 where code designed for kernarg preloading is executed on hardware equipped with
5319 incompatible firmware. If hardware has compatible firmware the 256 bytes at the
5320 start of the kernel entry will be skipped.
5322 .. _amdgpu-amdhsa-kernel-prolog:
5327 The compiler performs initialization in the kernel prologue depending on the
5328 target and information about things like stack usage in the kernel and called
5329 functions. Some of this initialization requires the compiler to request certain
5330 User and System SGPRs be present in the
5331 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
5332 :ref:`amdgpu-amdhsa-kernel-descriptor`.
5334 .. _amdgpu-amdhsa-kernel-prolog-cfi:
5339 1. The CFI return address is undefined.
5341 2. The CFI CFA is defined using an expression which evaluates to a location
5342 description that comprises one memory location description for the
5343 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
5345 .. _amdgpu-amdhsa-kernel-prolog-m0:
5351 The M0 register must be initialized with a value at least the total LDS size
5352 if the kernel may access LDS via DS or flat operations. Total LDS size is
5353 available in dispatch packet. For M0, it is also possible to use maximum
5354 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
5357 The M0 register is not used for range checking LDS accesses and so does not
5358 need to be initialized in the prolog.
5360 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
5365 If the kernel has function calls it must set up the ABI stack pointer described
5366 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
5367 SGPR32 to the unswizzled scratch offset of the address past the last local
5370 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
5375 If the kernel needs a frame pointer for the reasons defined in
5376 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
5377 kernel prolog. If a frame pointer is not required then all uses of the frame
5378 pointer are replaced with immediate ``0`` offsets.
5380 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
5385 There are different methods used for initializing flat scratch:
5387 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5388 specifies *Does not support generic address space*:
5390 Flat scratch is not supported and there is no flat scratch register pair.
5392 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5393 specifies *Offset flat scratch*:
5395 If the kernel or any function it calls may use flat operations to access
5396 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
5397 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
5398 Scratch Wavefront Offset SGPR registers (see
5399 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
5401 1. The low word of Flat Scratch Init is the 32-bit byte offset from
5402 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
5403 being managed by SPI for the queue executing the kernel dispatch. This is
5404 the same value used in the Scratch Segment Buffer V# base address.
5406 CP obtains this from the runtime. (The Scratch Segment Buffer base address
5407 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
5409 The prolog must add the value of Scratch Wavefront Offset to get the
5410 wavefront's byte scratch backing memory offset from
5411 ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
5413 The Scratch Wavefront Offset must also be used as an offset with Private
5414 segment address when using the Scratch Segment Buffer.
5416 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
5417 shifted by 8 before moving into FLAT_SCRATCH_HI.
5419 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
5420 SGPRn is the highest numbered SGPR allocated to the wavefront).
5421 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
5422 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
5423 FLAT SCRATCH BASE in flat memory instructions that access the scratch
5425 2. The second word of Flat Scratch Init is 32-bit byte size of a single
5426 work-items scratch memory usage.
5428 CP obtains this from the runtime, and it is always a multiple of DWORD. CP
5429 checks that the value in the kernel dispatch packet Private Segment Byte
5430 Size is not larger and requests the runtime to increase the queue's scratch
5433 CP directly loads from the kernel dispatch packet Private Segment Byte Size
5434 field and rounds up to a multiple of DWORD. Having CP load it once avoids
5435 loading it at the beginning of every wavefront.
5437 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
5438 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
5439 in flat memory instructions.
5441 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5442 specifies *Absolute flat scratch*:
5444 If the kernel or any function it calls may use flat operations to access
5445 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
5446 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
5447 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
5448 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
5450 The Flat Scratch Init is the 64-bit address of the base of scratch backing
5451 memory being managed by SPI for the queue executing the kernel dispatch.
5453 CP obtains this from the runtime.
5455 The kernel prolog must add the value of the wave's Scratch Wavefront Offset
5456 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
5457 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
5458 memory instructions.
5460 The Scratch Wavefront Offset must also be used as an offset with Private
5461 segment address when using the Scratch Segment Buffer (see
5462 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
5464 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5465 specifies *Architected flat scratch*:
5467 If ENABLE_PRIVATE_SEGMENT is enabled in
5468 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table` then the FLAT_SCRATCH
5469 register pair will be initialized to the 64-bit address of the base of scratch
5470 backing memory being managed by SPI for the queue executing the kernel
5471 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
5472 flat scratch base in flat memory instructions.
5474 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
5476 Private Segment Buffer
5477 ++++++++++++++++++++++
5479 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
5480 *Architected flat scratch* then a Private Segment Buffer is not supported.
5481 Instead the flat SCRATCH instructions are used.
5483 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
5484 that are used as a V# to access scratch. CP uses the value provided by the
5485 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
5486 access the private memory space using a segment address. See
5487 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
5489 The scratch V# is a four-aligned SGPR and always selected for the kernel as
5492 - If it is known during instruction selection that there is stack usage,
5493 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
5494 optimizations are disabled (``-O0``), if stack objects already exist (for
5495 locals, etc.), or if there are any function calls.
5497 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
5498 are reserved for the tentative scratch V#. These will be used if it is
5499 determined that spilling is needed.
5501 - If no use is made of the tentative scratch V#, then it is unreserved,
5502 and the register count is determined ignoring it.
5503 - If use is made of the tentative scratch V#, then its register numbers
5504 are shifted to the first four-aligned SGPR index after the highest one
5505 allocated by the register allocator, and all uses are updated. The
5506 register count includes them in the shifted location.
5507 - In either case, if the processor has the SGPR allocation bug, the
5508 tentative allocation is not shifted or unreserved in order to ensure
5509 the register count is higher to workaround the bug.
5513 This approach of using a tentative scratch V# and shifting the register
5514 numbers if used avoids having to perform register allocation a second
5515 time if the tentative V# is eliminated. This is more efficient and
5516 avoids the problem that the second register allocation may perform
5517 spilling which will fail as there is no longer a scratch V#.
5519 When the kernel prolog code is being emitted it is known whether the scratch V#
5520 described above is actually used. If it is, the prolog code must set it up by
5521 copying the Private Segment Buffer to the scratch V# registers and then adding
5522 the Private Segment Wavefront Offset to the queue base address in the V#. The
5523 result is a V# with a base address pointing to the beginning of the wavefront
5524 scratch backing memory.
5526 The Private Segment Buffer is always requested, but the Private Segment
5527 Wavefront Offset is only requested if it is used (see
5528 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5530 .. _amdgpu-amdhsa-memory-model:
5535 This section describes the mapping of the LLVM memory model onto AMDGPU machine
5536 code (see :ref:`memmodel`).
5538 The AMDGPU backend supports the memory synchronization scopes specified in
5539 :ref:`amdgpu-memory-scopes`.
5541 The code sequences used to implement the memory model specify the order of
5542 instructions that a single thread must execute. The ``s_waitcnt`` and cache
5543 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
5544 to other memory instructions executed by the same thread. This allows them to be
5545 moved earlier or later which can allow them to be combined with other instances
5546 of the same instruction, or hoisted/sunk out of loops to improve performance.
5547 Only the instructions related to the memory model are given; additional
5548 ``s_waitcnt`` instructions are required to ensure registers are defined before
5549 being used. These may be able to be combined with the memory model ``s_waitcnt``
5550 instructions as described above.
5552 The AMDGPU backend supports the following memory models:
5554 HSA Memory Model [HSA]_
5555 The HSA memory model uses a single happens-before relation for all address
5556 spaces (see :ref:`amdgpu-address-spaces`).
5557 OpenCL Memory Model [OpenCL]_
5558 The OpenCL memory model which has separate happens-before relations for the
5559 global and local address spaces. Only a fence specifying both global and
5560 local address space, and seq_cst instructions join the relationships. Since
5561 the LLVM ``memfence`` instruction does not allow an address space to be
5562 specified the OpenCL fence has to conservatively assume both local and
5563 global address space was specified. However, optimizations can often be
5564 done to eliminate the additional ``s_waitcnt`` instructions when there are
5565 no intervening memory instructions which access the corresponding address
5566 space. The code sequences in the table indicate what can be omitted for the
5567 OpenCL memory. The target triple environment is used to determine if the
5568 source language is OpenCL (see :ref:`amdgpu-opencl`).
5570 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
5573 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
5574 termed vector memory operations.
5576 Private address space uses ``buffer_load/store`` using the scratch V#
5577 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
5578 is accessing the memory, atomic memory orderings are not meaningful, and all
5579 accesses are treated as non-atomic.
5581 Constant address space uses ``buffer/global_load`` instructions (or equivalent
5582 scalar memory instructions). Since the constant address space contents do not
5583 change during the execution of a kernel dispatch it is not legal to perform
5584 stores, and atomic memory orderings are not meaningful, and all accesses are
5585 treated as non-atomic.
5587 A memory synchronization scope wider than work-group is not meaningful for the
5588 group (LDS) address space and is treated as work-group.
5590 The memory model does not support the region address space which is treated as
5593 Acquire memory ordering is not meaningful on store atomic instructions and is
5594 treated as non-atomic.
5596 Release memory ordering is not meaningful on load atomic instructions and is
5597 treated a non-atomic.
5599 Acquire-release memory ordering is not meaningful on load or store atomic
5600 instructions and is treated as acquire and release respectively.
5602 The memory order also adds the single thread optimization constraints defined in
5604 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
5606 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
5607 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
5609 ============ ==============================================================
5610 LLVM Memory Optimization Constraints
5612 ============ ==============================================================
5615 acquire - If a load atomic/atomicrmw then no following load/load
5616 atomic/store/store atomic/atomicrmw/fence instruction can be
5617 moved before the acquire.
5618 - If a fence then same as load atomic, plus no preceding
5619 associated fence-paired-atomic can be moved after the fence.
5620 release - If a store atomic/atomicrmw then no preceding load/load
5621 atomic/store/store atomic/atomicrmw/fence instruction can be
5622 moved after the release.
5623 - If a fence then same as store atomic, plus no following
5624 associated fence-paired-atomic can be moved before the
5626 acq_rel Same constraints as both acquire and release.
5627 seq_cst - If a load atomic then same constraints as acquire, plus no
5628 preceding sequentially consistent load atomic/store
5629 atomic/atomicrmw/fence instruction can be moved after the
5631 - If a store atomic then the same constraints as release, plus
5632 no following sequentially consistent load atomic/store
5633 atomic/atomicrmw/fence instruction can be moved before the
5635 - If an atomicrmw/fence then same constraints as acq_rel.
5636 ============ ==============================================================
5638 The code sequences used to implement the memory model are defined in the
5641 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5642 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5643 * :ref:`amdgpu-amdhsa-memory-model-gfx942`
5644 * :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
5646 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5648 Memory Model GFX6-GFX9
5649 ++++++++++++++++++++++
5653 * Each agent has multiple shader arrays (SA).
5654 * Each SA has multiple compute units (CU).
5655 * Each CU has multiple SIMDs that execute wavefronts.
5656 * The wavefronts for a single work-group are executed in the same CU but may be
5657 executed by different SIMDs.
5658 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
5660 * All LDS operations of a CU are performed as wavefront wide operations in a
5661 global order and involve no caching. Completion is reported to a wavefront in
5663 * The LDS memory has multiple request queues shared by the SIMDs of a
5664 CU. Therefore, the LDS operations performed by different wavefronts of a
5665 work-group can be reordered relative to each other, which can result in
5666 reordering the visibility of vector memory operations with respect to LDS
5667 operations of other wavefronts in the same work-group. A ``s_waitcnt
5668 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5669 vector memory operations between wavefronts of a work-group, but not between
5670 operations performed by the same wavefront.
5671 * The vector memory operations are performed as wavefront wide operations and
5672 completion is reported to a wavefront in execution order. The exception is
5673 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5674 vector memory order if they access LDS memory, and out of LDS operation order
5675 if they access global memory.
5676 * The vector memory operations access a single vector L1 cache shared by all
5677 SIMDs a CU. Therefore, no special action is required for coherence between the
5678 lanes of a single wavefront, or for coherence between wavefronts in the same
5679 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5680 wavefronts executing in different work-groups as they may be executing on
5682 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
5683 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5684 scalar operations are used in a restricted way so do not impact the memory
5685 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5686 * The vector and scalar memory operations use an L2 cache shared by all CUs on
5688 * The L2 cache has independent channels to service disjoint ranges of virtual
5690 * Each CU has a separate request queue per channel. Therefore, the vector and
5691 scalar memory operations performed by wavefronts executing in different
5692 work-groups (which may be executing on different CUs) of an agent can be
5693 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
5694 ensure synchronization between vector memory operations of different CUs. It
5695 ensures a previous vector memory operation has completed before executing a
5696 subsequent vector memory or LDS operation and so can be used to meet the
5697 requirements of acquire and release.
5698 * The L2 cache can be kept coherent with other agents on some targets, or ranges
5699 of virtual addresses can be set up to bypass it to ensure system coherence.
5701 Scalar memory operations are only used to access memory that is proven to not
5702 change during the execution of the kernel dispatch. This includes constant
5703 address space and global address space for program scope ``const`` variables.
5704 Therefore, the kernel machine code does not have to maintain the scalar cache to
5705 ensure it is coherent with the vector caches. The scalar and vector caches are
5706 invalidated between kernel dispatches by CP since constant address space data
5707 may change between kernel dispatch executions. See
5708 :ref:`amdgpu-amdhsa-memory-spaces`.
5710 The one exception is if scalar writes are used to spill SGPR registers. In this
5711 case the AMDGPU backend ensures the memory location used to spill is never
5712 accessed by vector memory operations at the same time. If scalar writes are used
5713 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
5714 return since the locations may be used for vector memory instructions by a
5715 future wavefront that uses the same scratch area, or a function call that
5716 creates a frame at the same address, respectively. There is no need for a
5717 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
5719 For kernarg backing memory:
5721 * CP invalidates the L1 cache at the start of each kernel dispatch.
5722 * On dGPU the kernarg backing memory is allocated in host memory accessed as
5723 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
5724 causes it to be treated as non-volatile and so is not invalidated by
5726 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
5727 and so the L2 cache will be coherent with the CPU and other agents.
5729 Scratch backing memory (which is used for the private address space) is accessed
5730 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
5731 only accessed by a single thread, and is always write-before-read, there is
5732 never a need to invalidate these entries from the L1 cache. Hence all cache
5733 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
5735 The code sequences used to implement the memory model for GFX6-GFX9 are defined
5736 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
5738 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
5739 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
5741 ============ ============ ============== ========== ================================
5742 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
5743 Ordering Sync Scope Address GFX6-GFX9
5745 ============ ============ ============== ========== ================================
5747 ------------------------------------------------------------------------------------
5748 load *none* *none* - global - !volatile & !nontemporal
5750 - private 1. buffer/global/flat_load
5752 - !volatile & nontemporal
5754 1. buffer/global/flat_load
5759 1. buffer/global/flat_load
5761 2. s_waitcnt vmcnt(0)
5763 - Must happen before
5764 any following volatile
5775 load *none* *none* - local 1. ds_load
5776 store *none* *none* - global - !volatile & !nontemporal
5778 - private 1. buffer/global/flat_store
5780 - !volatile & nontemporal
5782 1. buffer/global/flat_store
5787 1. buffer/global/flat_store
5788 2. s_waitcnt vmcnt(0)
5790 - Must happen before
5791 any following volatile
5802 store *none* *none* - local 1. ds_store
5803 **Unordered Atomic**
5804 ------------------------------------------------------------------------------------
5805 load atomic unordered *any* *any* *Same as non-atomic*.
5806 store atomic unordered *any* *any* *Same as non-atomic*.
5807 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
5808 **Monotonic Atomic**
5809 ------------------------------------------------------------------------------------
5810 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load
5812 - workgroup - generic
5813 load atomic monotonic - agent - global 1. buffer/global/flat_load
5814 - system - generic glc=1
5815 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
5816 - wavefront - generic
5820 store atomic monotonic - singlethread - local 1. ds_store
5823 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
5824 - wavefront - generic
5828 atomicrmw monotonic - singlethread - local 1. ds_atomic
5832 ------------------------------------------------------------------------------------
5833 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
5836 load atomic acquire - workgroup - global 1. buffer/global_load
5837 load atomic acquire - workgroup - local 1. ds/flat_load
5838 - generic 2. s_waitcnt lgkmcnt(0)
5841 - Must happen before
5850 older than a local load
5854 load atomic acquire - agent - global 1. buffer/global_load
5856 2. s_waitcnt vmcnt(0)
5858 - Must happen before
5866 3. buffer_wbinvl1_vol
5868 - Must happen before
5878 load atomic acquire - agent - generic 1. flat_load glc=1
5879 - system 2. s_waitcnt vmcnt(0) &
5884 - Must happen before
5887 - Ensures the flat_load
5892 3. buffer_wbinvl1_vol
5894 - Must happen before
5904 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
5907 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
5908 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
5909 - generic 2. s_waitcnt lgkmcnt(0)
5912 - Must happen before
5925 atomicrmw acquire - agent - global 1. buffer/global_atomic
5926 - system 2. s_waitcnt vmcnt(0)
5928 - Must happen before
5937 3. buffer_wbinvl1_vol
5939 - Must happen before
5949 atomicrmw acquire - agent - generic 1. flat_atomic
5950 - system 2. s_waitcnt vmcnt(0) &
5955 - Must happen before
5964 3. buffer_wbinvl1_vol
5966 - Must happen before
5976 fence acquire - singlethread *none* *none*
5978 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5983 - However, since LLVM
6008 fence-paired-atomic).
6009 - Must happen before
6020 fence-paired-atomic.
6022 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
6029 - However, since LLVM
6037 - Could be split into
6046 - s_waitcnt vmcnt(0)
6057 fence-paired-atomic).
6058 - s_waitcnt lgkmcnt(0)
6069 fence-paired-atomic).
6070 - Must happen before
6084 fence-paired-atomic.
6086 2. buffer_wbinvl1_vol
6088 - Must happen before any
6089 following global/generic
6099 ------------------------------------------------------------------------------------
6100 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
6103 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
6112 - Must happen before
6123 2. buffer/global/flat_store
6124 store atomic release - workgroup - local 1. ds_store
6125 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
6126 - system - generic vmcnt(0)
6132 - Could be split into
6141 - s_waitcnt vmcnt(0)
6148 - s_waitcnt lgkmcnt(0)
6155 - Must happen before
6166 2. buffer/global/flat_store
6167 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
6170 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
6179 - Must happen before
6190 2. buffer/global/flat_atomic
6191 atomicrmw release - workgroup - local 1. ds_atomic
6192 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
6193 - system - generic vmcnt(0)
6197 - Could be split into
6206 - s_waitcnt vmcnt(0)
6213 - s_waitcnt lgkmcnt(0)
6220 - Must happen before
6231 2. buffer/global/flat_atomic
6232 fence release - singlethread *none* *none*
6234 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6239 - However, since LLVM
6260 - Must happen before
6269 fence-paired-atomic).
6276 fence-paired-atomic.
6278 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
6289 - However, since LLVM
6304 - Could be split into
6313 - s_waitcnt vmcnt(0)
6320 - s_waitcnt lgkmcnt(0)
6327 - Must happen before
6336 fence-paired-atomic).
6343 fence-paired-atomic.
6345 **Acquire-Release Atomic**
6346 ------------------------------------------------------------------------------------
6347 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
6350 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
6359 - Must happen before
6370 2. buffer/global_atomic
6372 atomicrmw acq_rel - workgroup - local 1. ds_atomic
6373 2. s_waitcnt lgkmcnt(0)
6376 - Must happen before
6385 older than the local load
6389 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
6398 - Must happen before
6410 3. s_waitcnt lgkmcnt(0)
6413 - Must happen before
6422 older than a local load
6426 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
6431 - Could be split into
6440 - s_waitcnt vmcnt(0)
6447 - s_waitcnt lgkmcnt(0)
6454 - Must happen before
6465 2. buffer/global_atomic
6466 3. s_waitcnt vmcnt(0)
6468 - Must happen before
6477 4. buffer_wbinvl1_vol
6479 - Must happen before
6489 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
6494 - Could be split into
6503 - s_waitcnt vmcnt(0)
6510 - s_waitcnt lgkmcnt(0)
6517 - Must happen before
6529 3. s_waitcnt vmcnt(0) &
6534 - Must happen before
6543 4. buffer_wbinvl1_vol
6545 - Must happen before
6555 fence acq_rel - singlethread *none* *none*
6557 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6577 - Must happen before
6600 acquire-fence-paired-atomic)
6621 release-fence-paired-atomic).
6626 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
6633 - However, since LLVM
6641 - Could be split into
6650 - s_waitcnt vmcnt(0)
6657 - s_waitcnt lgkmcnt(0)
6664 - Must happen before
6669 global/local/generic
6678 acquire-fence-paired-atomic)
6690 global/local/generic
6699 release-fence-paired-atomic).
6704 2. buffer_wbinvl1_vol
6706 - Must happen before
6720 **Sequential Consistent Atomic**
6721 ------------------------------------------------------------------------------------
6722 load atomic seq_cst - singlethread - global *Same as corresponding
6723 - wavefront - local load atomic acquire,
6724 - generic except must generate
6725 all instructions even
6727 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
6743 lgkmcnt(0) and so do
6775 order. The s_waitcnt
6776 could be placed after
6780 make the s_waitcnt be
6787 instructions same as
6790 except must generate
6791 all instructions even
6793 load atomic seq_cst - workgroup - local *Same as corresponding
6794 load atomic acquire,
6795 except must generate
6796 all instructions even
6799 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
6800 - system - generic vmcnt(0)
6802 - Could be split into
6811 - s_waitcnt lgkmcnt(0)
6824 lgkmcnt(0) and so do
6827 - s_waitcnt vmcnt(0)
6872 order. The s_waitcnt
6873 could be placed after
6877 make the s_waitcnt be
6884 instructions same as
6887 except must generate
6888 all instructions even
6890 store atomic seq_cst - singlethread - global *Same as corresponding
6891 - wavefront - local store atomic release,
6892 - workgroup - generic except must generate
6893 - agent all instructions even
6894 - system for OpenCL.*
6895 atomicrmw seq_cst - singlethread - global *Same as corresponding
6896 - wavefront - local atomicrmw acq_rel,
6897 - workgroup - generic except must generate
6898 - agent all instructions even
6899 - system for OpenCL.*
6900 fence seq_cst - singlethread *none* *Same as corresponding
6901 - wavefront fence acq_rel,
6902 - workgroup except must generate
6903 - agent all instructions even
6904 - system for OpenCL.*
6905 ============ ============ ============== ========== ================================
6907 .. _amdgpu-amdhsa-memory-model-gfx90a:
6914 * Each agent has multiple shader arrays (SA).
6915 * Each SA has multiple compute units (CU).
6916 * Each CU has multiple SIMDs that execute wavefronts.
6917 * The wavefronts for a single work-group are executed in the same CU but may be
6918 executed by different SIMDs. The exception is when in tgsplit execution mode
6919 when the wavefronts may be executed by different SIMDs in different CUs.
6920 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
6921 executing on it. The exception is when in tgsplit execution mode when no LDS
6922 is allocated as wavefronts of the same work-group can be in different CUs.
6923 * All LDS operations of a CU are performed as wavefront wide operations in a
6924 global order and involve no caching. Completion is reported to a wavefront in
6926 * The LDS memory has multiple request queues shared by the SIMDs of a
6927 CU. Therefore, the LDS operations performed by different wavefronts of a
6928 work-group can be reordered relative to each other, which can result in
6929 reordering the visibility of vector memory operations with respect to LDS
6930 operations of other wavefronts in the same work-group. A ``s_waitcnt
6931 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6932 vector memory operations between wavefronts of a work-group, but not between
6933 operations performed by the same wavefront.
6934 * The vector memory operations are performed as wavefront wide operations and
6935 completion is reported to a wavefront in execution order. The exception is
6936 that ``flat_load/store/atomic`` instructions can report out of vector memory
6937 order if they access LDS memory, and out of LDS operation order if they access
6939 * The vector memory operations access a single vector L1 cache shared by all
6940 SIMDs a CU. Therefore:
6942 * No special action is required for coherence between the lanes of a single
6945 * No special action is required for coherence between wavefronts in the same
6946 work-group since they execute on the same CU. The exception is when in
6947 tgsplit execution mode as wavefronts of the same work-group can be in
6948 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6951 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6952 executing in different work-groups as they may be executing on different
6955 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
6956 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6957 scalar operations are used in a restricted way so do not impact the memory
6958 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6959 * The vector and scalar memory operations use an L2 cache shared by all CUs on
6962 * The L2 cache has independent channels to service disjoint ranges of virtual
6964 * Each CU has a separate request queue per channel. Therefore, the vector and
6965 scalar memory operations performed by wavefronts executing in different
6966 work-groups (which may be executing on different CUs), or the same
6967 work-group if executing in tgsplit mode, of an agent can be reordered
6968 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6969 synchronization between vector memory operations of different CUs. It
6970 ensures a previous vector memory operation has completed before executing a
6971 subsequent vector memory or LDS operation and so can be used to meet the
6972 requirements of acquire and release.
6973 * The L2 cache of one agent can be kept coherent with other agents by:
6974 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6975 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6976 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6978 * Any local memory cache lines will be automatically invalidated by writes
6979 from CUs associated with other L2 caches, or writes from the CPU, due to
6980 the cache probe caused by coherent requests. Coherent requests are caused
6981 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6982 XGMI, and by PCIe requests that are configured to be coherent requests.
6983 * XGMI accesses from the CPU to local memory may be cached on the CPU.
6984 Subsequent access from the GPU will automatically invalidate or writeback
6985 the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6986 * Since all work-groups on the same agent share the same L2, no L2
6987 invalidation or writeback is required for coherence.
6988 * To ensure coherence of local and remote memory writes of work-groups in
6989 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6990 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6991 ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6992 fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6993 remote fine grain memory) bypasses the L2, so both will never result in
6994 dirty L2 cache lines.
6995 * To ensure coherence of local and remote memory reads of work-groups in
6996 different agents a ``buffer_invl2`` is required. It will invalidate L2
6997 cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6998 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6999 coarse memory) cause local reads to be invalidated by remote writes with
7000 with the PTE C-bit so these cache lines are not invalidated. Note that
7001 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
7002 never result in L2 cache lines that need to be invalidated.
7004 * PCIe access from the GPU to the CPU memory is kept coherent by using the
7005 MTYPE UC (uncached) which bypasses the L2.
7007 Scalar memory operations are only used to access memory that is proven to not
7008 change during the execution of the kernel dispatch. This includes constant
7009 address space and global address space for program scope ``const`` variables.
7010 Therefore, the kernel machine code does not have to maintain the scalar cache to
7011 ensure it is coherent with the vector caches. The scalar and vector caches are
7012 invalidated between kernel dispatches by CP since constant address space data
7013 may change between kernel dispatch executions. See
7014 :ref:`amdgpu-amdhsa-memory-spaces`.
7016 The one exception is if scalar writes are used to spill SGPR registers. In this
7017 case the AMDGPU backend ensures the memory location used to spill is never
7018 accessed by vector memory operations at the same time. If scalar writes are used
7019 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
7020 return since the locations may be used for vector memory instructions by a
7021 future wavefront that uses the same scratch area, or a function call that
7022 creates a frame at the same address, respectively. There is no need for a
7023 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
7025 For kernarg backing memory:
7027 * CP invalidates the L1 cache at the start of each kernel dispatch.
7028 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
7029 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
7030 cache. This also causes it to be treated as non-volatile and so is not
7031 invalidated by ``*_vol``.
7032 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
7033 so the L2 cache will be coherent with the CPU and other agents.
7035 Scratch backing memory (which is used for the private address space) is accessed
7036 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
7037 only accessed by a single thread, and is always write-before-read, there is
7038 never a need to invalidate these entries from the L1 cache. Hence all cache
7039 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
7041 The code sequences used to implement the memory model for GFX90A are defined
7042 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
7044 .. table:: AMDHSA Memory Model Code Sequences GFX90A
7045 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
7047 ============ ============ ============== ========== ================================
7048 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
7049 Ordering Sync Scope Address GFX90A
7051 ============ ============ ============== ========== ================================
7053 ------------------------------------------------------------------------------------
7054 load *none* *none* - global - !volatile & !nontemporal
7056 - private 1. buffer/global/flat_load
7058 - !volatile & nontemporal
7060 1. buffer/global/flat_load
7065 1. buffer/global/flat_load
7067 2. s_waitcnt vmcnt(0)
7069 - Must happen before
7070 any following volatile
7081 load *none* *none* - local 1. ds_load
7082 store *none* *none* - global - !volatile & !nontemporal
7084 - private 1. buffer/global/flat_store
7086 - !volatile & nontemporal
7088 1. buffer/global/flat_store
7093 1. buffer/global/flat_store
7094 2. s_waitcnt vmcnt(0)
7096 - Must happen before
7097 any following volatile
7108 store *none* *none* - local 1. ds_store
7109 **Unordered Atomic**
7110 ------------------------------------------------------------------------------------
7111 load atomic unordered *any* *any* *Same as non-atomic*.
7112 store atomic unordered *any* *any* *Same as non-atomic*.
7113 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
7114 **Monotonic Atomic**
7115 ------------------------------------------------------------------------------------
7116 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
7117 - wavefront - generic
7118 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
7121 - If not TgSplit execution
7124 load atomic monotonic - singlethread - local *If TgSplit execution mode,
7125 - wavefront local address space cannot
7126 - workgroup be used.*
7129 load atomic monotonic - agent - global 1. buffer/global/flat_load
7131 load atomic monotonic - system - global 1. buffer/global/flat_load
7133 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
7134 - wavefront - generic
7137 store atomic monotonic - system - global 1. buffer/global/flat_store
7139 store atomic monotonic - singlethread - local *If TgSplit execution mode,
7140 - wavefront local address space cannot
7141 - workgroup be used.*
7144 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
7145 - wavefront - generic
7148 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
7150 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
7151 - wavefront local address space cannot
7152 - workgroup be used.*
7156 ------------------------------------------------------------------------------------
7157 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
7160 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
7162 - If not TgSplit execution
7165 2. s_waitcnt vmcnt(0)
7167 - If not TgSplit execution
7169 - Must happen before the
7170 following buffer_wbinvl1_vol.
7172 3. buffer_wbinvl1_vol
7174 - If not TgSplit execution
7176 - Must happen before
7187 load atomic acquire - workgroup - local *If TgSplit execution mode,
7188 local address space cannot
7192 2. s_waitcnt lgkmcnt(0)
7195 - Must happen before
7204 older than the local load
7208 load atomic acquire - workgroup - generic 1. flat_load glc=1
7210 - If not TgSplit execution
7213 2. s_waitcnt lgkm/vmcnt(0)
7215 - Use lgkmcnt(0) if not
7216 TgSplit execution mode
7217 and vmcnt(0) if TgSplit
7219 - If OpenCL, omit lgkmcnt(0).
7220 - Must happen before
7222 buffer_wbinvl1_vol and any
7223 following global/generic
7230 older than a local load
7234 3. buffer_wbinvl1_vol
7236 - If not TgSplit execution
7243 load atomic acquire - agent - global 1. buffer/global_load
7245 2. s_waitcnt vmcnt(0)
7247 - Must happen before
7255 3. buffer_wbinvl1_vol
7257 - Must happen before
7267 load atomic acquire - system - global 1. buffer/global/flat_load
7269 2. s_waitcnt vmcnt(0)
7271 - Must happen before
7272 following buffer_invl2 and
7282 - Must happen before
7290 stale L1 global data,
7291 nor see stale L2 MTYPE
7293 MTYPE RW and CC memory will
7294 never be stale in L2 due to
7297 load atomic acquire - agent - generic 1. flat_load glc=1
7298 2. s_waitcnt vmcnt(0) &
7301 - If TgSplit execution mode,
7305 - Must happen before
7308 - Ensures the flat_load
7313 3. buffer_wbinvl1_vol
7315 - Must happen before
7325 load atomic acquire - system - generic 1. flat_load glc=1
7326 2. s_waitcnt vmcnt(0) &
7329 - If TgSplit execution mode,
7333 - Must happen before
7337 - Ensures the flat_load
7345 - Must happen before
7353 stale L1 global data,
7354 nor see stale L2 MTYPE
7356 MTYPE RW and CC memory will
7357 never be stale in L2 due to
7360 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
7361 - wavefront - generic
7362 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
7363 - wavefront local address space cannot
7367 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
7368 2. s_waitcnt vmcnt(0)
7370 - If not TgSplit execution
7372 - Must happen before the
7373 following buffer_wbinvl1_vol.
7374 - Ensures the atomicrmw
7379 3. buffer_wbinvl1_vol
7381 - If not TgSplit execution
7383 - Must happen before
7393 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
7394 local address space cannot
7398 2. s_waitcnt lgkmcnt(0)
7401 - Must happen before
7410 older than the local
7414 atomicrmw acquire - workgroup - generic 1. flat_atomic
7415 2. s_waitcnt lgkm/vmcnt(0)
7417 - Use lgkmcnt(0) if not
7418 TgSplit execution mode
7419 and vmcnt(0) if TgSplit
7421 - If OpenCL, omit lgkmcnt(0).
7422 - Must happen before
7424 buffer_wbinvl1_vol and
7437 3. buffer_wbinvl1_vol
7439 - If not TgSplit execution
7446 atomicrmw acquire - agent - global 1. buffer/global_atomic
7447 2. s_waitcnt vmcnt(0)
7449 - Must happen before
7458 3. buffer_wbinvl1_vol
7460 - Must happen before
7470 atomicrmw acquire - system - global 1. buffer/global_atomic
7471 2. s_waitcnt vmcnt(0)
7473 - Must happen before
7474 following buffer_invl2 and
7485 - Must happen before
7493 stale L1 global data,
7494 nor see stale L2 MTYPE
7496 MTYPE RW and CC memory will
7497 never be stale in L2 due to
7500 atomicrmw acquire - agent - generic 1. flat_atomic
7501 2. s_waitcnt vmcnt(0) &
7504 - If TgSplit execution mode,
7508 - Must happen before
7517 3. buffer_wbinvl1_vol
7519 - Must happen before
7529 atomicrmw acquire - system - generic 1. flat_atomic
7530 2. s_waitcnt vmcnt(0) &
7533 - If TgSplit execution mode,
7537 - Must happen before
7550 - Must happen before
7558 stale L1 global data,
7559 nor see stale L2 MTYPE
7561 MTYPE RW and CC memory will
7562 never be stale in L2 due to
7565 fence acquire - singlethread *none* *none*
7567 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7569 - Use lgkmcnt(0) if not
7570 TgSplit execution mode
7571 and vmcnt(0) if TgSplit
7581 - However, since LLVM
7596 - s_waitcnt vmcnt(0)
7608 fence-paired-atomic).
7609 - s_waitcnt lgkmcnt(0)
7620 fence-paired-atomic).
7621 - Must happen before
7623 buffer_wbinvl1_vol and
7634 fence-paired-atomic.
7636 2. buffer_wbinvl1_vol
7638 - If not TgSplit execution
7645 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
7648 - If TgSplit execution mode,
7654 - However, since LLVM
7662 - Could be split into
7671 - s_waitcnt vmcnt(0)
7682 fence-paired-atomic).
7683 - s_waitcnt lgkmcnt(0)
7694 fence-paired-atomic).
7695 - Must happen before
7709 fence-paired-atomic.
7711 2. buffer_wbinvl1_vol
7713 - Must happen before any
7714 following global/generic
7723 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
7726 - If TgSplit execution mode,
7732 - However, since LLVM
7740 - Could be split into
7749 - s_waitcnt vmcnt(0)
7760 fence-paired-atomic).
7761 - s_waitcnt lgkmcnt(0)
7772 fence-paired-atomic).
7773 - Must happen before
7774 the following buffer_invl2 and
7787 fence-paired-atomic.
7792 - Must happen before any
7793 following global/generic
7800 stale L1 global data,
7801 nor see stale L2 MTYPE
7803 MTYPE RW and CC memory will
7804 never be stale in L2 due to
7807 ------------------------------------------------------------------------------------
7808 store atomic release - singlethread - global 1. buffer/global/flat_store
7809 - wavefront - generic
7810 store atomic release - singlethread - local *If TgSplit execution mode,
7811 - wavefront local address space cannot
7815 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7817 - Use lgkmcnt(0) if not
7818 TgSplit execution mode
7819 and vmcnt(0) if TgSplit
7821 - If OpenCL, omit lgkmcnt(0).
7822 - s_waitcnt vmcnt(0)
7825 global/generic load/store/
7826 load atomic/store atomic/
7828 - s_waitcnt lgkmcnt(0)
7835 - Must happen before
7846 2. buffer/global/flat_store
7847 store atomic release - workgroup - local *If TgSplit execution mode,
7848 local address space cannot
7852 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
7855 - If TgSplit execution mode,
7861 - Could be split into
7870 - s_waitcnt vmcnt(0)
7877 - s_waitcnt lgkmcnt(0)
7884 - Must happen before
7895 2. buffer/global/flat_store
7896 store atomic release - system - global 1. buffer_wbl2
7898 - Must happen before
7899 following s_waitcnt.
7900 - Performs L2 writeback to
7904 visible at system scope.
7906 2. s_waitcnt lgkmcnt(0) &
7909 - If TgSplit execution mode,
7915 - Could be split into
7924 - s_waitcnt vmcnt(0)
7925 must happen after any
7931 - s_waitcnt lgkmcnt(0)
7932 must happen after any
7938 - Must happen before
7943 to memory and the L2
7950 3. buffer/global/flat_store
7951 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
7952 - wavefront - generic
7953 atomicrmw release - singlethread - local *If TgSplit execution mode,
7954 - wavefront local address space cannot
7958 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7960 - Use lgkmcnt(0) if not
7961 TgSplit execution mode
7962 and vmcnt(0) if TgSplit
7966 - s_waitcnt vmcnt(0)
7969 global/generic load/store/
7970 load atomic/store atomic/
7972 - s_waitcnt lgkmcnt(0)
7979 - Must happen before
7990 2. buffer/global/flat_atomic
7991 atomicrmw release - workgroup - local *If TgSplit execution mode,
7992 local address space cannot
7996 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
7999 - If TgSplit execution mode,
8003 - Could be split into
8012 - s_waitcnt vmcnt(0)
8019 - s_waitcnt lgkmcnt(0)
8026 - Must happen before
8037 2. buffer/global/flat_atomic
8038 atomicrmw release - system - global 1. buffer_wbl2
8040 - Must happen before
8041 following s_waitcnt.
8042 - Performs L2 writeback to
8046 visible at system scope.
8048 2. s_waitcnt lgkmcnt(0) &
8051 - If TgSplit execution mode,
8055 - Could be split into
8064 - s_waitcnt vmcnt(0)
8071 - s_waitcnt lgkmcnt(0)
8078 - Must happen before
8083 to memory and the L2
8090 3. buffer/global/flat_atomic
8091 fence release - singlethread *none* *none*
8093 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
8095 - Use lgkmcnt(0) if not
8096 TgSplit execution mode
8097 and vmcnt(0) if TgSplit
8107 - However, since LLVM
8122 - s_waitcnt vmcnt(0)
8127 load atomic/store atomic/
8129 - s_waitcnt lgkmcnt(0)
8136 - Must happen before
8145 fence-paired-atomic).
8152 fence-paired-atomic.
8154 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
8157 - If TgSplit execution mode,
8167 - However, since LLVM
8182 - Could be split into
8191 - s_waitcnt vmcnt(0)
8198 - s_waitcnt lgkmcnt(0)
8205 - Must happen before
8214 fence-paired-atomic).
8221 fence-paired-atomic.
8223 fence release - system *none* 1. buffer_wbl2
8228 - Must happen before
8229 following s_waitcnt.
8230 - Performs L2 writeback to
8234 visible at system scope.
8236 2. s_waitcnt lgkmcnt(0) &
8239 - If TgSplit execution mode,
8249 - However, since LLVM
8264 - Could be split into
8273 - s_waitcnt vmcnt(0)
8280 - s_waitcnt lgkmcnt(0)
8287 - Must happen before
8296 fence-paired-atomic).
8303 fence-paired-atomic.
8305 **Acquire-Release Atomic**
8306 ------------------------------------------------------------------------------------
8307 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
8308 - wavefront - generic
8309 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
8310 - wavefront local address space cannot
8314 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8316 - Use lgkmcnt(0) if not
8317 TgSplit execution mode
8318 and vmcnt(0) if TgSplit
8328 - s_waitcnt vmcnt(0)
8331 global/generic load/store/
8332 load atomic/store atomic/
8334 - s_waitcnt lgkmcnt(0)
8341 - Must happen before
8352 2. buffer/global_atomic
8353 3. s_waitcnt vmcnt(0)
8355 - If not TgSplit execution
8357 - Must happen before
8367 4. buffer_wbinvl1_vol
8369 - If not TgSplit execution
8376 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
8377 local address space cannot
8381 2. s_waitcnt lgkmcnt(0)
8384 - Must happen before
8393 older than the local load
8397 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
8399 - Use lgkmcnt(0) if not
8400 TgSplit execution mode
8401 and vmcnt(0) if TgSplit
8405 - s_waitcnt vmcnt(0)
8408 global/generic load/store/
8409 load atomic/store atomic/
8411 - s_waitcnt lgkmcnt(0)
8418 - Must happen before
8430 3. s_waitcnt lgkmcnt(0) &
8433 - If not TgSplit execution
8434 mode, omit vmcnt(0).
8437 - Must happen before
8439 buffer_wbinvl1_vol and
8448 older than a local load
8452 3. buffer_wbinvl1_vol
8454 - If not TgSplit execution
8461 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
8464 - If TgSplit execution mode,
8468 - Could be split into
8477 - s_waitcnt vmcnt(0)
8484 - s_waitcnt lgkmcnt(0)
8491 - Must happen before
8502 2. buffer/global_atomic
8503 3. s_waitcnt vmcnt(0)
8505 - Must happen before
8514 4. buffer_wbinvl1_vol
8516 - Must happen before
8526 atomicrmw acq_rel - system - global 1. buffer_wbl2
8528 - Must happen before
8529 following s_waitcnt.
8530 - Performs L2 writeback to
8534 visible at system scope.
8536 2. s_waitcnt lgkmcnt(0) &
8539 - If TgSplit execution mode,
8543 - Could be split into
8552 - s_waitcnt vmcnt(0)
8559 - s_waitcnt lgkmcnt(0)
8566 - Must happen before
8571 to global and L2 writeback
8572 have completed before
8577 3. buffer/global_atomic
8578 4. s_waitcnt vmcnt(0)
8580 - Must happen before
8581 following buffer_invl2 and
8592 - Must happen before
8600 stale L1 global data,
8601 nor see stale L2 MTYPE
8603 MTYPE RW and CC memory will
8604 never be stale in L2 due to
8607 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
8610 - If TgSplit execution mode,
8614 - Could be split into
8623 - s_waitcnt vmcnt(0)
8630 - s_waitcnt lgkmcnt(0)
8637 - Must happen before
8649 3. s_waitcnt vmcnt(0) &
8652 - If TgSplit execution mode,
8656 - Must happen before
8665 4. buffer_wbinvl1_vol
8667 - Must happen before
8677 atomicrmw acq_rel - system - generic 1. buffer_wbl2
8679 - Must happen before
8680 following s_waitcnt.
8681 - Performs L2 writeback to
8685 visible at system scope.
8687 2. s_waitcnt lgkmcnt(0) &
8690 - If TgSplit execution mode,
8694 - Could be split into
8703 - s_waitcnt vmcnt(0)
8710 - s_waitcnt lgkmcnt(0)
8717 - Must happen before
8722 to global and L2 writeback
8723 have completed before
8729 4. s_waitcnt vmcnt(0) &
8732 - If TgSplit execution mode,
8736 - Must happen before
8737 following buffer_invl2 and
8748 - Must happen before
8756 stale L1 global data,
8757 nor see stale L2 MTYPE
8759 MTYPE RW and CC memory will
8760 never be stale in L2 due to
8763 fence acq_rel - singlethread *none* *none*
8765 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
8767 - Use lgkmcnt(0) if not
8768 TgSplit execution mode
8769 and vmcnt(0) if TgSplit
8788 - s_waitcnt vmcnt(0)
8793 load atomic/store atomic/
8795 - s_waitcnt lgkmcnt(0)
8802 - Must happen before
8825 acquire-fence-paired-atomic)
8846 release-fence-paired-atomic).
8850 - Must happen before
8854 acquire-fence-paired
8855 atomic has completed
8864 acquire-fence-paired-atomic.
8866 2. buffer_wbinvl1_vol
8868 - If not TgSplit execution
8875 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
8878 - If TgSplit execution mode,
8884 - However, since LLVM
8892 - Could be split into
8901 - s_waitcnt vmcnt(0)
8908 - s_waitcnt lgkmcnt(0)
8915 - Must happen before
8920 global/local/generic
8929 acquire-fence-paired-atomic)
8941 global/local/generic
8950 release-fence-paired-atomic).
8955 2. buffer_wbinvl1_vol
8957 - Must happen before
8971 fence acq_rel - system *none* 1. buffer_wbl2
8976 - Must happen before
8977 following s_waitcnt.
8978 - Performs L2 writeback to
8982 visible at system scope.
8984 2. s_waitcnt lgkmcnt(0) &
8987 - If TgSplit execution mode,
8993 - However, since LLVM
9001 - Could be split into
9010 - s_waitcnt vmcnt(0)
9017 - s_waitcnt lgkmcnt(0)
9024 - Must happen before
9025 the following buffer_invl2 and
9029 global/local/generic
9038 acquire-fence-paired-atomic)
9050 global/local/generic
9059 release-fence-paired-atomic).
9067 - Must happen before
9076 stale L1 global data,
9077 nor see stale L2 MTYPE
9079 MTYPE RW and CC memory will
9080 never be stale in L2 due to
9083 **Sequential Consistent Atomic**
9084 ------------------------------------------------------------------------------------
9085 load atomic seq_cst - singlethread - global *Same as corresponding
9086 - wavefront - local load atomic acquire,
9087 - generic except must generate
9088 all instructions even
9090 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
9092 - Use lgkmcnt(0) if not
9093 TgSplit execution mode
9094 and vmcnt(0) if TgSplit
9096 - s_waitcnt lgkmcnt(0) must
9109 lgkmcnt(0) and so do
9112 - s_waitcnt vmcnt(0)
9131 consistent global/local
9157 order. The s_waitcnt
9158 could be placed after
9162 make the s_waitcnt be
9169 instructions same as
9172 except must generate
9173 all instructions even
9175 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
9176 local address space cannot
9179 *Same as corresponding
9180 load atomic acquire,
9181 except must generate
9182 all instructions even
9185 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
9186 - system - generic vmcnt(0)
9188 - If TgSplit execution mode,
9190 - Could be split into
9199 - s_waitcnt lgkmcnt(0)
9212 lgkmcnt(0) and so do
9215 - s_waitcnt vmcnt(0)
9260 order. The s_waitcnt
9261 could be placed after
9265 make the s_waitcnt be
9272 instructions same as
9275 except must generate
9276 all instructions even
9278 store atomic seq_cst - singlethread - global *Same as corresponding
9279 - wavefront - local store atomic release,
9280 - workgroup - generic except must generate
9281 - agent all instructions even
9282 - system for OpenCL.*
9283 atomicrmw seq_cst - singlethread - global *Same as corresponding
9284 - wavefront - local atomicrmw acq_rel,
9285 - workgroup - generic except must generate
9286 - agent all instructions even
9287 - system for OpenCL.*
9288 fence seq_cst - singlethread *none* *Same as corresponding
9289 - wavefront fence acq_rel,
9290 - workgroup except must generate
9291 - agent all instructions even
9292 - system for OpenCL.*
9293 ============ ============ ============== ========== ================================
9295 .. _amdgpu-amdhsa-memory-model-gfx942:
9302 * Each agent has multiple shader arrays (SA).
9303 * Each SA has multiple compute units (CU).
9304 * Each CU has multiple SIMDs that execute wavefronts.
9305 * The wavefronts for a single work-group are executed in the same CU but may be
9306 executed by different SIMDs. The exception is when in tgsplit execution mode
9307 when the wavefronts may be executed by different SIMDs in different CUs.
9308 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
9309 executing on it. The exception is when in tgsplit execution mode when no LDS
9310 is allocated as wavefronts of the same work-group can be in different CUs.
9311 * All LDS operations of a CU are performed as wavefront wide operations in a
9312 global order and involve no caching. Completion is reported to a wavefront in
9314 * The LDS memory has multiple request queues shared by the SIMDs of a
9315 CU. Therefore, the LDS operations performed by different wavefronts of a
9316 work-group can be reordered relative to each other, which can result in
9317 reordering the visibility of vector memory operations with respect to LDS
9318 operations of other wavefronts in the same work-group. A ``s_waitcnt
9319 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
9320 vector memory operations between wavefronts of a work-group, but not between
9321 operations performed by the same wavefront.
9322 * The vector memory operations are performed as wavefront wide operations and
9323 completion is reported to a wavefront in execution order. The exception is
9324 that ``flat_load/store/atomic`` instructions can report out of vector memory
9325 order if they access LDS memory, and out of LDS operation order if they access
9327 * The vector memory operations access a single vector L1 cache shared by all
9328 SIMDs a CU. Therefore:
9330 * No special action is required for coherence between the lanes of a single
9333 * No special action is required for coherence between wavefronts in the same
9334 work-group since they execute on the same CU. The exception is when in
9335 tgsplit execution mode as wavefronts of the same work-group can be in
9336 different CUs and so a ``buffer_inv sc0`` is required which will invalidate
9339 * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
9340 between wavefronts executing in different work-groups as they may be
9341 executing on different CUs.
9343 * Atomic read-modify-write instructions implicitly bypass the L1 cache.
9344 Therefore, they do not use the sc0 bit for coherence and instead use it to
9345 indicate if the instruction returns the original value being updated. They
9346 do use sc1 to indicate system or agent scope coherence.
9348 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
9349 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
9350 scalar operations are used in a restricted way so do not impact the memory
9351 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
9352 * The vector and scalar memory operations use an L2 cache.
9354 * The gfx942 can be configured as a number of smaller agents with each having
9355 a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
9356 larger agents with groups of CUs on each agent each sharing separate L2
9358 * The L2 cache has independent channels to service disjoint ranges of virtual
9360 * Each CU has a separate request queue per channel for its associated L2.
9361 Therefore, the vector and scalar memory operations performed by wavefronts
9362 executing with different L1 caches and the same L2 cache can be reordered
9363 relative to each other.
9364 * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
9365 vector memory operations of different CUs. It ensures a previous vector
9366 memory operation has completed before executing a subsequent vector memory
9367 or LDS operation and so can be used to meet the requirements of acquire and
9369 * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
9370 (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
9371 the PTE C-bit set for memory not local to the L2.
9373 * Any local memory cache lines will be automatically invalidated by writes
9374 from CUs associated with other L2 caches, or writes from the CPU, due to
9375 the cache probe caused by the PTE C-bit.
9376 * XGMI accesses from the CPU to local memory may be cached on the CPU.
9377 Subsequent access from the GPU will automatically invalidate or writeback
9378 the CPU cache due to the L2 probe filter.
9379 * To ensure coherence of local memory writes of CUs with different L1 caches
9380 in the same agent a ``buffer_wbl2`` is required. It does nothing if the
9381 agent is configured to have a single L2, or will writeback dirty L2 cache
9382 lines if configured to have multiple L2 caches.
9383 * To ensure coherence of local memory writes of CUs in different agents a
9384 ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
9385 * To ensure coherence of local memory reads of CUs with different L1 caches
9386 in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
9387 agent is configured to have a single L2, or will invalidate non-local L2
9388 cache lines if configured to have multiple L2 caches.
9389 * To ensure coherence of local memory reads of CUs in different agents a
9390 ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
9391 lines if configured to have multiple L2 caches.
9393 * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
9394 UC (uncached) which bypasses the L2.
9396 Scalar memory operations are only used to access memory that is proven to not
9397 change during the execution of the kernel dispatch. This includes constant
9398 address space and global address space for program scope ``const`` variables.
9399 Therefore, the kernel machine code does not have to maintain the scalar cache to
9400 ensure it is coherent with the vector caches. The scalar and vector caches are
9401 invalidated between kernel dispatches by CP since constant address space data
9402 may change between kernel dispatch executions. See
9403 :ref:`amdgpu-amdhsa-memory-spaces`.
9405 The one exception is if scalar writes are used to spill SGPR registers. In this
9406 case the AMDGPU backend ensures the memory location used to spill is never
9407 accessed by vector memory operations at the same time. If scalar writes are used
9408 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
9409 return since the locations may be used for vector memory instructions by a
9410 future wavefront that uses the same scratch area, or a function call that
9411 creates a frame at the same address, respectively. There is no need for a
9412 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
9414 For kernarg backing memory:
9416 * CP invalidates the L1 cache at the start of each kernel dispatch.
9417 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
9418 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
9419 cache. This also causes it to be treated as non-volatile and so is not
9420 invalidated by ``*_vol``.
9421 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
9422 so the L2 cache will be coherent with the CPU and other agents.
9424 Scratch backing memory (which is used for the private address space) is accessed
9425 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
9426 only accessed by a single thread, and is always write-before-read, there is
9427 never a need to invalidate these entries from the L1 cache. Hence all cache
9428 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
9430 The code sequences used to implement the memory model for GFX940, GFX941, GFX942
9431 are defined in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-gfx941-gfx942-table`.
9433 .. table:: AMDHSA Memory Model Code Sequences GFX940, GFX941, GFX942
9434 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-gfx941-gfx942-table
9436 ============ ============ ============== ========== ================================
9437 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
9438 Ordering Sync Scope Address GFX940, GFX941, GFX942
9440 ============ ============ ============== ========== ================================
9442 ------------------------------------------------------------------------------------
9443 load *none* *none* - global - !volatile & !nontemporal
9445 - private 1. buffer/global/flat_load
9447 - !volatile & nontemporal
9449 1. buffer/global/flat_load
9454 1. buffer/global/flat_load
9456 2. s_waitcnt vmcnt(0)
9458 - Must happen before
9459 any following volatile
9470 load *none* *none* - local 1. ds_load
9471 store *none* *none* - global - !volatile & !nontemporal
9473 - private 1. GFX940, GFX941
9474 - constant buffer/global/flat_store
9477 buffer/global/flat_store
9479 - !volatile & nontemporal
9482 buffer/global/flat_store
9485 buffer/global/flat_store
9490 1. buffer/global/flat_store
9492 2. s_waitcnt vmcnt(0)
9494 - Must happen before
9495 any following volatile
9506 store *none* *none* - local 1. ds_store
9507 **Unordered Atomic**
9508 ------------------------------------------------------------------------------------
9509 load atomic unordered *any* *any* *Same as non-atomic*.
9510 store atomic unordered *any* *any* *Same as non-atomic*.
9511 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
9512 **Monotonic Atomic**
9513 ------------------------------------------------------------------------------------
9514 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
9515 - wavefront - generic
9516 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
9518 load atomic monotonic - singlethread - local *If TgSplit execution mode,
9519 - wavefront local address space cannot
9520 - workgroup be used.*
9523 load atomic monotonic - agent - global 1. buffer/global/flat_load
9525 load atomic monotonic - system - global 1. buffer/global/flat_load
9526 - generic sc0=1 sc1=1
9527 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
9528 - wavefront - generic
9529 store atomic monotonic - workgroup - global 1. buffer/global/flat_store
9531 store atomic monotonic - agent - global 1. buffer/global/flat_store
9533 store atomic monotonic - system - global 1. buffer/global/flat_store
9534 - generic sc0=1 sc1=1
9535 store atomic monotonic - singlethread - local *If TgSplit execution mode,
9536 - wavefront local address space cannot
9537 - workgroup be used.*
9540 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
9541 - wavefront - generic
9544 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
9546 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
9547 - wavefront local address space cannot
9548 - workgroup be used.*
9552 ------------------------------------------------------------------------------------
9553 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
9556 load atomic acquire - workgroup - global 1. buffer/global_load sc0=1
9557 2. s_waitcnt vmcnt(0)
9559 - If not TgSplit execution
9561 - Must happen before the
9562 following buffer_inv.
9566 - If not TgSplit execution
9568 - Must happen before
9579 load atomic acquire - workgroup - local *If TgSplit execution mode,
9580 local address space cannot
9584 2. s_waitcnt lgkmcnt(0)
9587 - Must happen before
9596 older than the local load
9600 load atomic acquire - workgroup - generic 1. flat_load sc0=1
9601 2. s_waitcnt lgkm/vmcnt(0)
9603 - Use lgkmcnt(0) if not
9604 TgSplit execution mode
9605 and vmcnt(0) if TgSplit
9607 - If OpenCL, omit lgkmcnt(0).
9608 - Must happen before
9611 following global/generic
9618 older than a local load
9624 - If not TgSplit execution
9631 load atomic acquire - agent - global 1. buffer/global_load
9633 2. s_waitcnt vmcnt(0)
9635 - Must happen before
9645 - Must happen before
9655 load atomic acquire - system - global 1. buffer/global/flat_load
9657 2. s_waitcnt vmcnt(0)
9659 - Must happen before
9667 3. buffer_inv sc0=1 sc1=1
9669 - Must happen before
9677 stale MTYPE NC global data.
9678 MTYPE RW and CC memory will
9679 never be stale due to the
9682 load atomic acquire - agent - generic 1. flat_load sc1=1
9683 2. s_waitcnt vmcnt(0) &
9686 - If TgSplit execution mode,
9690 - Must happen before
9693 - Ensures the flat_load
9700 - Must happen before
9710 load atomic acquire - system - generic 1. flat_load sc0=1 sc1=1
9711 2. s_waitcnt vmcnt(0) &
9714 - If TgSplit execution mode,
9718 - Must happen before
9721 - Ensures the flat_load
9726 3. buffer_inv sc0=1 sc1=1
9728 - Must happen before
9736 stale MTYPE NC global data.
9737 MTYPE RW and CC memory will
9738 never be stale due to the
9741 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
9742 - wavefront - generic
9743 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
9744 - wavefront local address space cannot
9748 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
9749 2. s_waitcnt vmcnt(0)
9751 - If not TgSplit execution
9753 - Must happen before the
9754 following buffer_inv.
9755 - Ensures the atomicrmw
9762 - If not TgSplit execution
9764 - Must happen before
9774 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
9775 local address space cannot
9779 2. s_waitcnt lgkmcnt(0)
9782 - Must happen before
9791 older than the local
9795 atomicrmw acquire - workgroup - generic 1. flat_atomic
9796 2. s_waitcnt lgkm/vmcnt(0)
9798 - Use lgkmcnt(0) if not
9799 TgSplit execution mode
9800 and vmcnt(0) if TgSplit
9802 - If OpenCL, omit lgkmcnt(0).
9803 - Must happen before
9820 - If not TgSplit execution
9827 atomicrmw acquire - agent - global 1. buffer/global_atomic
9828 2. s_waitcnt vmcnt(0)
9830 - Must happen before
9841 - Must happen before
9851 atomicrmw acquire - system - global 1. buffer/global_atomic
9853 2. s_waitcnt vmcnt(0)
9855 - Must happen before
9864 3. buffer_inv sc0=1 sc1=1
9866 - Must happen before
9874 stale MTYPE NC global data.
9875 MTYPE RW and CC memory will
9876 never be stale due to the
9879 atomicrmw acquire - agent - generic 1. flat_atomic
9880 2. s_waitcnt vmcnt(0) &
9883 - If TgSplit execution mode,
9887 - Must happen before
9898 - Must happen before
9908 atomicrmw acquire - system - generic 1. flat_atomic sc1=1
9909 2. s_waitcnt vmcnt(0) &
9912 - If TgSplit execution mode,
9916 - Must happen before
9925 3. buffer_inv sc0=1 sc1=1
9927 - Must happen before
9935 stale MTYPE NC global data.
9936 MTYPE RW and CC memory will
9937 never be stale due to the
9940 fence acquire - singlethread *none* *none*
9942 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
9944 - Use lgkmcnt(0) if not
9945 TgSplit execution mode
9946 and vmcnt(0) if TgSplit
9956 - However, since LLVM
9971 - s_waitcnt vmcnt(0)
9983 fence-paired-atomic).
9984 - s_waitcnt lgkmcnt(0)
9995 fence-paired-atomic).
9996 - Must happen before
10009 fence-paired-atomic.
10011 3. buffer_inv sc0=1
10013 - If not TgSplit execution
10020 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
10023 - If TgSplit execution mode,
10029 - However, since LLVM
10037 - Could be split into
10041 lgkmcnt(0) to allow
10043 independently moved
10046 - s_waitcnt vmcnt(0)
10049 global/generic load
10053 and memory ordering
10057 fence-paired-atomic).
10058 - s_waitcnt lgkmcnt(0)
10065 and memory ordering
10069 fence-paired-atomic).
10070 - Must happen before
10074 fence-paired atomic
10076 before invalidating
10080 locations read must
10084 fence-paired-atomic.
10086 2. buffer_inv sc1=1
10088 - Must happen before any
10089 following global/generic
10098 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
10101 - If TgSplit execution mode,
10107 - However, since LLVM
10115 - Could be split into
10119 lgkmcnt(0) to allow
10121 independently moved
10124 - s_waitcnt vmcnt(0)
10127 global/generic load
10131 and memory ordering
10135 fence-paired-atomic).
10136 - s_waitcnt lgkmcnt(0)
10143 and memory ordering
10147 fence-paired-atomic).
10148 - Must happen before
10152 fence-paired atomic
10154 before invalidating
10158 locations read must
10162 fence-paired-atomic.
10164 2. buffer_inv sc0=1 sc1=1
10166 - Must happen before any
10167 following global/generic
10177 ------------------------------------------------------------------------------------
10178 store atomic release - singlethread - global 1. GFX940, GFX941
10179 - wavefront - generic buffer/global/flat_store
10182 buffer/global/flat_store
10184 store atomic release - singlethread - local *If TgSplit execution mode,
10185 - wavefront local address space cannot
10189 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10191 - Use lgkmcnt(0) if not
10192 TgSplit execution mode
10193 and vmcnt(0) if TgSplit
10195 - If OpenCL, omit lgkmcnt(0).
10196 - s_waitcnt vmcnt(0)
10199 global/generic load/store/
10200 load atomic/store atomic/
10202 - s_waitcnt lgkmcnt(0)
10209 - Must happen before
10217 store that is being
10221 buffer/global/flat_store
10224 buffer/global/flat_store
10226 store atomic release - workgroup - local *If TgSplit execution mode,
10227 local address space cannot
10231 store atomic release - agent - global 1. buffer_wbl2 sc1=1
10233 - Must happen before
10234 following s_waitcnt.
10235 - Performs L2 writeback to
10238 store/atomicrmw are
10239 visible at agent scope.
10241 2. s_waitcnt lgkmcnt(0) &
10244 - If TgSplit execution mode,
10250 - Could be split into
10254 lgkmcnt(0) to allow
10256 independently moved
10259 - s_waitcnt vmcnt(0)
10266 - s_waitcnt lgkmcnt(0)
10273 - Must happen before
10281 store that is being
10285 buffer/global/flat_store
10288 buffer/global/flat_store
10290 store atomic release - system - global 1. buffer_wbl2 sc0=1 sc1=1
10292 - Must happen before
10293 following s_waitcnt.
10294 - Performs L2 writeback to
10297 store/atomicrmw are
10298 visible at system scope.
10300 2. s_waitcnt lgkmcnt(0) &
10303 - If TgSplit execution mode,
10309 - Could be split into
10313 lgkmcnt(0) to allow
10315 independently moved
10318 - s_waitcnt vmcnt(0)
10319 must happen after any
10325 - s_waitcnt lgkmcnt(0)
10326 must happen after any
10332 - Must happen before
10337 to memory and the L2
10341 store that is being
10344 3. buffer/global/flat_store
10346 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
10347 - wavefront - generic
10348 atomicrmw release - singlethread - local *If TgSplit execution mode,
10349 - wavefront local address space cannot
10353 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10355 - Use lgkmcnt(0) if not
10356 TgSplit execution mode
10357 and vmcnt(0) if TgSplit
10361 - s_waitcnt vmcnt(0)
10364 global/generic load/store/
10365 load atomic/store atomic/
10367 - s_waitcnt lgkmcnt(0)
10374 - Must happen before
10385 2. buffer/global/flat_atomic sc0=1
10386 atomicrmw release - workgroup - local *If TgSplit execution mode,
10387 local address space cannot
10391 atomicrmw release - agent - global 1. buffer_wbl2 sc1=1
10393 - Must happen before
10394 following s_waitcnt.
10395 - Performs L2 writeback to
10398 store/atomicrmw are
10399 visible at agent scope.
10401 2. s_waitcnt lgkmcnt(0) &
10404 - If TgSplit execution mode,
10408 - Could be split into
10412 lgkmcnt(0) to allow
10414 independently moved
10417 - s_waitcnt vmcnt(0)
10424 - s_waitcnt lgkmcnt(0)
10431 - Must happen before
10436 to global and local
10442 3. buffer/global/flat_atomic sc1=1
10443 atomicrmw release - system - global 1. buffer_wbl2 sc0=1 sc1=1
10445 - Must happen before
10446 following s_waitcnt.
10447 - Performs L2 writeback to
10450 store/atomicrmw are
10451 visible at system scope.
10453 2. s_waitcnt lgkmcnt(0) &
10456 - If TgSplit execution mode,
10460 - Could be split into
10464 lgkmcnt(0) to allow
10466 independently moved
10469 - s_waitcnt vmcnt(0)
10476 - s_waitcnt lgkmcnt(0)
10483 - Must happen before
10488 to memory and the L2
10492 store that is being
10495 3. buffer/global/flat_atomic
10497 fence release - singlethread *none* *none*
10499 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
10501 - Use lgkmcnt(0) if not
10502 TgSplit execution mode
10503 and vmcnt(0) if TgSplit
10513 - However, since LLVM
10518 always generate. If
10528 - s_waitcnt vmcnt(0)
10533 load atomic/store atomic/
10535 - s_waitcnt lgkmcnt(0)
10542 - Must happen before
10543 any following store
10547 and memory ordering
10551 fence-paired-atomic).
10558 fence-paired-atomic.
10560 fence release - agent *none* 1. buffer_wbl2 sc1=1
10565 - Must happen before
10566 following s_waitcnt.
10567 - Performs L2 writeback to
10570 store/atomicrmw are
10571 visible at agent scope.
10573 2. s_waitcnt lgkmcnt(0) &
10576 - If TgSplit execution mode,
10586 - However, since LLVM
10591 always generate. If
10601 - Could be split into
10605 lgkmcnt(0) to allow
10607 independently moved
10610 - s_waitcnt vmcnt(0)
10617 - s_waitcnt lgkmcnt(0)
10624 - Must happen before
10625 any following store
10629 and memory ordering
10633 fence-paired-atomic).
10640 fence-paired-atomic.
10642 fence release - system *none* 1. buffer_wbl2 sc0=1 sc1=1
10644 - Must happen before
10645 following s_waitcnt.
10646 - Performs L2 writeback to
10649 store/atomicrmw are
10650 visible at system scope.
10652 2. s_waitcnt lgkmcnt(0) &
10655 - If TgSplit execution mode,
10665 - However, since LLVM
10670 always generate. If
10680 - Could be split into
10684 lgkmcnt(0) to allow
10686 independently moved
10689 - s_waitcnt vmcnt(0)
10696 - s_waitcnt lgkmcnt(0)
10703 - Must happen before
10704 any following store
10708 and memory ordering
10712 fence-paired-atomic).
10719 fence-paired-atomic.
10721 **Acquire-Release Atomic**
10722 ------------------------------------------------------------------------------------
10723 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
10724 - wavefront - generic
10725 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
10726 - wavefront local address space cannot
10730 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10732 - Use lgkmcnt(0) if not
10733 TgSplit execution mode
10734 and vmcnt(0) if TgSplit
10738 - Must happen after
10744 - s_waitcnt vmcnt(0)
10747 global/generic load/store/
10748 load atomic/store atomic/
10750 - s_waitcnt lgkmcnt(0)
10757 - Must happen before
10768 2. buffer/global_atomic
10769 3. s_waitcnt vmcnt(0)
10771 - If not TgSplit execution
10773 - Must happen before
10783 4. buffer_inv sc0=1
10785 - If not TgSplit execution
10792 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
10793 local address space cannot
10797 2. s_waitcnt lgkmcnt(0)
10800 - Must happen before
10809 older than the local load
10813 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
10815 - Use lgkmcnt(0) if not
10816 TgSplit execution mode
10817 and vmcnt(0) if TgSplit
10821 - s_waitcnt vmcnt(0)
10824 global/generic load/store/
10825 load atomic/store atomic/
10827 - s_waitcnt lgkmcnt(0)
10834 - Must happen before
10846 3. s_waitcnt lgkmcnt(0) &
10849 - If not TgSplit execution
10850 mode, omit vmcnt(0).
10853 - Must happen before
10864 older than a local load
10868 3. buffer_inv sc0=1
10870 - If not TgSplit execution
10877 atomicrmw acq_rel - agent - global 1. buffer_wbl2 sc1=1
10879 - Must happen before
10880 following s_waitcnt.
10881 - Performs L2 writeback to
10884 store/atomicrmw are
10885 visible at agent scope.
10887 2. s_waitcnt lgkmcnt(0) &
10890 - If TgSplit execution mode,
10894 - Could be split into
10898 lgkmcnt(0) to allow
10900 independently moved
10903 - s_waitcnt vmcnt(0)
10910 - s_waitcnt lgkmcnt(0)
10917 - Must happen before
10928 3. buffer/global_atomic
10929 4. s_waitcnt vmcnt(0)
10931 - Must happen before
10940 5. buffer_inv sc1=1
10942 - Must happen before
10952 atomicrmw acq_rel - system - global 1. buffer_wbl2 sc0=1 sc1=1
10954 - Must happen before
10955 following s_waitcnt.
10956 - Performs L2 writeback to
10959 store/atomicrmw are
10960 visible at system scope.
10962 2. s_waitcnt lgkmcnt(0) &
10965 - If TgSplit execution mode,
10969 - Could be split into
10973 lgkmcnt(0) to allow
10975 independently moved
10978 - s_waitcnt vmcnt(0)
10985 - s_waitcnt lgkmcnt(0)
10992 - Must happen before
10997 to global and L2 writeback
10998 have completed before
11003 3. buffer/global_atomic
11005 4. s_waitcnt vmcnt(0)
11007 - Must happen before
11016 5. buffer_inv sc0=1 sc1=1
11018 - Must happen before
11026 MTYPE NC global data.
11027 MTYPE RW and CC memory will
11028 never be stale due to the
11031 atomicrmw acq_rel - agent - generic 1. buffer_wbl2 sc1=1
11033 - Must happen before
11034 following s_waitcnt.
11035 - Performs L2 writeback to
11038 store/atomicrmw are
11039 visible at agent scope.
11041 2. s_waitcnt lgkmcnt(0) &
11044 - If TgSplit execution mode,
11048 - Could be split into
11052 lgkmcnt(0) to allow
11054 independently moved
11057 - s_waitcnt vmcnt(0)
11064 - s_waitcnt lgkmcnt(0)
11071 - Must happen before
11083 4. s_waitcnt vmcnt(0) &
11086 - If TgSplit execution mode,
11090 - Must happen before
11099 5. buffer_inv sc1=1
11101 - Must happen before
11111 atomicrmw acq_rel - system - generic 1. buffer_wbl2 sc0=1 sc1=1
11113 - Must happen before
11114 following s_waitcnt.
11115 - Performs L2 writeback to
11118 store/atomicrmw are
11119 visible at system scope.
11121 2. s_waitcnt lgkmcnt(0) &
11124 - If TgSplit execution mode,
11128 - Could be split into
11132 lgkmcnt(0) to allow
11134 independently moved
11137 - s_waitcnt vmcnt(0)
11144 - s_waitcnt lgkmcnt(0)
11151 - Must happen before
11156 to global and L2 writeback
11157 have completed before
11162 3. flat_atomic sc1=1
11163 4. s_waitcnt vmcnt(0) &
11166 - If TgSplit execution mode,
11170 - Must happen before
11179 5. buffer_inv sc0=1 sc1=1
11181 - Must happen before
11189 MTYPE NC global data.
11190 MTYPE RW and CC memory will
11191 never be stale due to the
11194 fence acq_rel - singlethread *none* *none*
11196 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
11198 - Use lgkmcnt(0) if not
11199 TgSplit execution mode
11200 and vmcnt(0) if TgSplit
11219 - s_waitcnt vmcnt(0)
11224 load atomic/store atomic/
11226 - s_waitcnt lgkmcnt(0)
11233 - Must happen before
11252 and memory ordering
11256 acquire-fence-paired-atomic)
11269 local/generic store
11273 and memory ordering
11277 release-fence-paired-atomic).
11281 - Must happen before
11285 acquire-fence-paired
11286 atomic has completed
11287 before invalidating
11291 locations read must
11295 acquire-fence-paired-atomic.
11297 3. buffer_inv sc0=1
11299 - If not TgSplit execution
11306 fence acq_rel - agent *none* 1. buffer_wbl2 sc1=1
11311 - Must happen before
11312 following s_waitcnt.
11313 - Performs L2 writeback to
11316 store/atomicrmw are
11317 visible at agent scope.
11319 2. s_waitcnt lgkmcnt(0) &
11322 - If TgSplit execution mode,
11328 - However, since LLVM
11336 - Could be split into
11340 lgkmcnt(0) to allow
11342 independently moved
11345 - s_waitcnt vmcnt(0)
11352 - s_waitcnt lgkmcnt(0)
11359 - Must happen before
11364 global/local/generic
11369 and memory ordering
11373 acquire-fence-paired-atomic)
11375 before invalidating
11385 global/local/generic
11390 and memory ordering
11394 release-fence-paired-atomic).
11399 3. buffer_inv sc1=1
11401 - Must happen before
11415 fence acq_rel - system *none* 1. buffer_wbl2 sc0=1 sc1=1
11420 - Must happen before
11421 following s_waitcnt.
11422 - Performs L2 writeback to
11425 store/atomicrmw are
11426 visible at system scope.
11428 1. s_waitcnt lgkmcnt(0) &
11431 - If TgSplit execution mode,
11437 - However, since LLVM
11445 - Could be split into
11449 lgkmcnt(0) to allow
11451 independently moved
11454 - s_waitcnt vmcnt(0)
11461 - s_waitcnt lgkmcnt(0)
11468 - Must happen before
11473 global/local/generic
11478 and memory ordering
11482 acquire-fence-paired-atomic)
11484 before invalidating
11494 global/local/generic
11499 and memory ordering
11503 release-fence-paired-atomic).
11508 2. buffer_inv sc0=1 sc1=1
11510 - Must happen before
11519 MTYPE NC global data.
11520 MTYPE RW and CC memory will
11521 never be stale due to the
11524 **Sequential Consistent Atomic**
11525 ------------------------------------------------------------------------------------
11526 load atomic seq_cst - singlethread - global *Same as corresponding
11527 - wavefront - local load atomic acquire,
11528 - generic except must generate
11529 all instructions even
11531 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
11533 - Use lgkmcnt(0) if not
11534 TgSplit execution mode
11535 and vmcnt(0) if TgSplit
11537 - s_waitcnt lgkmcnt(0) must
11544 ordering of seq_cst
11550 lgkmcnt(0) and so do
11553 - s_waitcnt vmcnt(0)
11556 global/generic load
11560 ordering of seq_cst
11572 consistent global/local
11573 memory instructions
11579 prevents reordering
11582 seq_cst load. (Note
11588 followed by a store
11595 release followed by
11598 order. The s_waitcnt
11599 could be placed after
11600 seq_store or before
11603 make the s_waitcnt be
11604 as late as possible
11610 instructions same as
11613 except must generate
11614 all instructions even
11616 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
11617 local address space cannot
11620 *Same as corresponding
11621 load atomic acquire,
11622 except must generate
11623 all instructions even
11626 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
11627 - system - generic vmcnt(0)
11629 - If TgSplit execution mode,
11631 - Could be split into
11635 lgkmcnt(0) to allow
11637 independently moved
11640 - s_waitcnt lgkmcnt(0)
11643 global/generic load
11647 ordering of seq_cst
11653 lgkmcnt(0) and so do
11656 - s_waitcnt vmcnt(0)
11659 global/generic load
11663 ordering of seq_cst
11676 memory instructions
11682 prevents reordering
11685 seq_cst load. (Note
11691 followed by a store
11698 release followed by
11701 order. The s_waitcnt
11702 could be placed after
11703 seq_store or before
11706 make the s_waitcnt be
11707 as late as possible
11713 instructions same as
11716 except must generate
11717 all instructions even
11719 store atomic seq_cst - singlethread - global *Same as corresponding
11720 - wavefront - local store atomic release,
11721 - workgroup - generic except must generate
11722 - agent all instructions even
11723 - system for OpenCL.*
11724 atomicrmw seq_cst - singlethread - global *Same as corresponding
11725 - wavefront - local atomicrmw acq_rel,
11726 - workgroup - generic except must generate
11727 - agent all instructions even
11728 - system for OpenCL.*
11729 fence seq_cst - singlethread *none* *Same as corresponding
11730 - wavefront fence acq_rel,
11731 - workgroup except must generate
11732 - agent all instructions even
11733 - system for OpenCL.*
11734 ============ ============ ============== ========== ================================
11736 .. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
11738 Memory Model GFX10-GFX11
11739 ++++++++++++++++++++++++
11743 * Each agent has multiple shader arrays (SA).
11744 * Each SA has multiple work-group processors (WGP).
11745 * Each WGP has multiple compute units (CU).
11746 * Each CU has multiple SIMDs that execute wavefronts.
11747 * The wavefronts for a single work-group are executed in the same
11748 WGP. In CU wavefront execution mode the wavefronts may be executed by
11749 different SIMDs in the same CU. In WGP wavefront execution mode the
11750 wavefronts may be executed by different SIMDs in different CUs in the same
11752 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
11754 * All LDS operations of a WGP are performed as wavefront wide operations in a
11755 global order and involve no caching. Completion is reported to a wavefront in
11757 * The LDS memory has multiple request queues shared by the SIMDs of a
11758 WGP. Therefore, the LDS operations performed by different wavefronts of a
11759 work-group can be reordered relative to each other, which can result in
11760 reordering the visibility of vector memory operations with respect to LDS
11761 operations of other wavefronts in the same work-group. A ``s_waitcnt
11762 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
11763 vector memory operations between wavefronts of a work-group, but not between
11764 operations performed by the same wavefront.
11765 * The vector memory operations are performed as wavefront wide operations.
11766 Completion of load/store/sample operations are reported to a wavefront in
11767 execution order of other load/store/sample operations performed by that
11769 * The vector memory operations access a vector L0 cache. There is a single L0
11770 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
11771 special action is required for coherence between the lanes of a single
11772 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
11773 wavefronts executing in the same work-group as they may be executing on SIMDs
11774 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
11775 required for coherence between wavefronts executing in different work-groups
11776 as they may be executing on different WGPs.
11777 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
11778 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
11779 operations are used in a restricted way so do not impact the memory model. See
11780 :ref:`amdgpu-amdhsa-memory-spaces`.
11781 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
11782 the same SA. Therefore, no special action is required for coherence between
11783 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
11784 required for coherence between wavefronts executing in different work-groups
11785 as they may be executing on different SAs that access different L1s.
11786 * The L1 caches have independent quadrants to service disjoint ranges of virtual
11788 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
11789 vector and scalar memory operations performed by different wavefronts, whether
11790 executing in the same or different work-groups (which may be executing on
11791 different CUs accessing different L0s), can be reordered relative to each
11792 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
11793 synchronization between vector memory operations of different wavefronts. It
11794 ensures a previous vector memory operation has completed before executing a
11795 subsequent vector memory or LDS operation and so can be used to meet the
11796 requirements of acquire, release and sequential consistency.
11797 * The L1 caches use an L2 cache shared by all SAs on the same agent.
11798 * The L2 cache has independent channels to service disjoint ranges of virtual
11800 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
11801 quadrant has a separate request queue per L2 channel. Therefore, the vector
11802 and scalar memory operations performed by wavefronts executing in different
11803 work-groups (which may be executing on different SAs) of an agent can be
11804 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
11805 required to ensure synchronization between vector memory operations of
11806 different SAs. It ensures a previous vector memory operation has completed
11807 before executing a subsequent vector memory and so can be used to meet the
11808 requirements of acquire, release and sequential consistency.
11809 * The L2 cache can be kept coherent with other agents on some targets, or ranges
11810 of virtual addresses can be set up to bypass it to ensure system coherence.
11811 * On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
11812 The MALL cache is fully coherent with GPU memory and has no impact on system
11813 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
11815 Scalar memory operations are only used to access memory that is proven to not
11816 change during the execution of the kernel dispatch. This includes constant
11817 address space and global address space for program scope ``const`` variables.
11818 Therefore, the kernel machine code does not have to maintain the scalar cache to
11819 ensure it is coherent with the vector caches. The scalar and vector caches are
11820 invalidated between kernel dispatches by CP since constant address space data
11821 may change between kernel dispatch executions. See
11822 :ref:`amdgpu-amdhsa-memory-spaces`.
11824 The one exception is if scalar writes are used to spill SGPR registers. In this
11825 case the AMDGPU backend ensures the memory location used to spill is never
11826 accessed by vector memory operations at the same time. If scalar writes are used
11827 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
11828 return since the locations may be used for vector memory instructions by a
11829 future wavefront that uses the same scratch area, or a function call that
11830 creates a frame at the same address, respectively. There is no need for a
11831 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
11833 For kernarg backing memory:
11835 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
11836 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
11837 needing to invalidate the L2 cache.
11838 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
11839 so the L2 cache will be coherent with the CPU and other agents.
11841 Scratch backing memory (which is used for the private address space) is accessed
11842 with MTYPE NC (non-coherent). Since the private address space is only accessed
11843 by a single thread, and is always write-before-read, there is never a need to
11844 invalidate these entries from the L0 or L1 caches.
11846 Wavefronts are executed in native mode with in-order reporting of loads and
11847 sample instructions. In this mode vmcnt reports completion of load, atomic with
11848 return and sample instructions in order, and the vscnt reports the completion of
11849 store and atomic without return in order. See ``MEM_ORDERED`` field in
11850 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
11852 Wavefronts can be executed in WGP or CU wavefront execution mode:
11854 * In WGP wavefront execution mode the wavefronts of a work-group are executed
11855 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
11856 CU L0 caches is required for work-group synchronization. Also accesses to L1
11857 at work-group scope need to be explicitly ordered as the accesses from
11858 different CUs are not ordered.
11859 * In CU wavefront execution mode the wavefronts of a work-group are executed on
11860 the SIMDs of a single CU of the WGP. Therefore, all global memory access by
11861 the work-group access the same L0 which in turn ensures L1 accesses are
11862 ordered and so do not require explicit management of the caches for
11863 work-group synchronization.
11865 See ``WGP_MODE`` field in
11866 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table` and
11867 :ref:`amdgpu-target-features`.
11869 The code sequences used to implement the memory model for GFX10-GFX11 are defined in
11870 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
11872 .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
11873 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
11875 ============ ============ ============== ========== ================================
11876 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
11877 Ordering Sync Scope Address GFX10-GFX11
11879 ============ ============ ============== ========== ================================
11881 ------------------------------------------------------------------------------------
11882 load *none* *none* - global - !volatile & !nontemporal
11884 - private 1. buffer/global/flat_load
11886 - !volatile & nontemporal
11888 1. buffer/global/flat_load
11891 - If GFX10, omit dlc=1.
11895 1. buffer/global/flat_load
11898 2. s_waitcnt vmcnt(0)
11900 - Must happen before
11901 any following volatile
11912 load *none* *none* - local 1. ds_load
11913 store *none* *none* - global - !volatile & !nontemporal
11915 - private 1. buffer/global/flat_store
11917 - !volatile & nontemporal
11919 1. buffer/global/flat_store
11922 - If GFX10, omit dlc=1.
11926 1. buffer/global/flat_store
11929 - If GFX10, omit dlc=1.
11931 2. s_waitcnt vscnt(0)
11933 - Must happen before
11934 any following volatile
11945 store *none* *none* - local 1. ds_store
11946 **Unordered Atomic**
11947 ------------------------------------------------------------------------------------
11948 load atomic unordered *any* *any* *Same as non-atomic*.
11949 store atomic unordered *any* *any* *Same as non-atomic*.
11950 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
11951 **Monotonic Atomic**
11952 ------------------------------------------------------------------------------------
11953 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
11954 - wavefront - generic
11955 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
11958 - If CU wavefront execution
11961 load atomic monotonic - singlethread - local 1. ds_load
11964 load atomic monotonic - agent - global 1. buffer/global/flat_load
11965 - system - generic glc=1 dlc=1
11967 - If GFX11, omit dlc=1.
11969 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
11970 - wavefront - generic
11974 store atomic monotonic - singlethread - local 1. ds_store
11977 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
11978 - wavefront - generic
11982 atomicrmw monotonic - singlethread - local 1. ds_atomic
11986 ------------------------------------------------------------------------------------
11987 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
11988 - wavefront - local
11990 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
11992 - If CU wavefront execution
11995 2. s_waitcnt vmcnt(0)
11997 - If CU wavefront execution
11999 - Must happen before
12000 the following buffer_gl0_inv
12001 and before any following
12009 - If CU wavefront execution
12016 load atomic acquire - workgroup - local 1. ds_load
12017 2. s_waitcnt lgkmcnt(0)
12020 - Must happen before
12021 the following buffer_gl0_inv
12022 and before any following
12023 global/generic load/load
12029 older than the local load
12035 - If CU wavefront execution
12043 load atomic acquire - workgroup - generic 1. flat_load glc=1
12045 - If CU wavefront execution
12048 2. s_waitcnt lgkmcnt(0) &
12051 - If CU wavefront execution
12052 mode, omit vmcnt(0).
12055 - Must happen before
12057 buffer_gl0_inv and any
12058 following global/generic
12065 older than a local load
12071 - If CU wavefront execution
12078 load atomic acquire - agent - global 1. buffer/global_load
12079 - system glc=1 dlc=1
12081 - If GFX11, omit dlc=1.
12083 2. s_waitcnt vmcnt(0)
12085 - Must happen before
12090 before invalidating
12096 - Must happen before
12106 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1
12108 - If GFX11, omit dlc=1.
12110 2. s_waitcnt vmcnt(0) &
12115 - Must happen before
12118 - Ensures the flat_load
12120 before invalidating
12126 - Must happen before
12136 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
12137 - wavefront - local
12139 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
12140 2. s_waitcnt vm/vscnt(0)
12142 - If CU wavefront execution
12144 - Use vmcnt(0) if atomic with
12145 return and vscnt(0) if
12146 atomic with no-return.
12147 - Must happen before
12148 the following buffer_gl0_inv
12149 and before any following
12157 - If CU wavefront execution
12164 atomicrmw acquire - workgroup - local 1. ds_atomic
12165 2. s_waitcnt lgkmcnt(0)
12168 - Must happen before
12174 older than the local
12186 atomicrmw acquire - workgroup - generic 1. flat_atomic
12187 2. s_waitcnt lgkmcnt(0) &
12190 - If CU wavefront execution
12191 mode, omit vm/vscnt(0).
12192 - If OpenCL, omit lgkmcnt(0).
12193 - Use vmcnt(0) if atomic with
12194 return and vscnt(0) if
12195 atomic with no-return.
12196 - Must happen before
12208 - If CU wavefront execution
12215 atomicrmw acquire - agent - global 1. buffer/global_atomic
12216 - system 2. s_waitcnt vm/vscnt(0)
12218 - Use vmcnt(0) if atomic with
12219 return and vscnt(0) if
12220 atomic with no-return.
12221 - Must happen before
12233 - Must happen before
12243 atomicrmw acquire - agent - generic 1. flat_atomic
12244 - system 2. s_waitcnt vm/vscnt(0) &
12249 - Use vmcnt(0) if atomic with
12250 return and vscnt(0) if
12251 atomic with no-return.
12252 - Must happen before
12264 - Must happen before
12274 fence acquire - singlethread *none* *none*
12276 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12277 vmcnt(0) & vscnt(0)
12279 - If CU wavefront execution
12280 mode, omit vmcnt(0) and
12289 vmcnt(0) and vscnt(0).
12290 - However, since LLVM
12295 always generate. If
12305 - Could be split into
12307 vmcnt(0), s_waitcnt
12308 vscnt(0) and s_waitcnt
12309 lgkmcnt(0) to allow
12311 independently moved
12314 - s_waitcnt vmcnt(0)
12317 global/generic load
12319 atomicrmw-with-return-value
12322 and memory ordering
12326 fence-paired-atomic).
12327 - s_waitcnt vscnt(0)
12331 atomicrmw-no-return-value
12334 and memory ordering
12338 fence-paired-atomic).
12339 - s_waitcnt lgkmcnt(0)
12346 and memory ordering
12350 fence-paired-atomic).
12351 - Must happen before
12355 fence-paired atomic
12357 before invalidating
12361 locations read must
12365 fence-paired-atomic.
12369 - If CU wavefront execution
12376 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
12377 - system vmcnt(0) & vscnt(0)
12386 vmcnt(0) and vscnt(0).
12387 - However, since LLVM
12395 - Could be split into
12397 vmcnt(0), s_waitcnt
12398 vscnt(0) and s_waitcnt
12399 lgkmcnt(0) to allow
12401 independently moved
12404 - s_waitcnt vmcnt(0)
12407 global/generic load
12409 atomicrmw-with-return-value
12412 and memory ordering
12416 fence-paired-atomic).
12417 - s_waitcnt vscnt(0)
12421 atomicrmw-no-return-value
12424 and memory ordering
12428 fence-paired-atomic).
12429 - s_waitcnt lgkmcnt(0)
12436 and memory ordering
12440 fence-paired-atomic).
12441 - Must happen before
12445 fence-paired atomic
12447 before invalidating
12451 locations read must
12455 fence-paired-atomic.
12460 - Must happen before any
12461 following global/generic
12471 ------------------------------------------------------------------------------------
12472 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
12473 - wavefront - local
12475 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12476 - generic vmcnt(0) & vscnt(0)
12478 - If CU wavefront execution
12479 mode, omit vmcnt(0) and
12483 - Could be split into
12485 vmcnt(0), s_waitcnt
12486 vscnt(0) and s_waitcnt
12487 lgkmcnt(0) to allow
12489 independently moved
12492 - s_waitcnt vmcnt(0)
12495 global/generic load/load
12497 atomicrmw-with-return-value.
12498 - s_waitcnt vscnt(0)
12504 atomicrmw-no-return-value.
12505 - s_waitcnt lgkmcnt(0)
12512 - Must happen before
12520 store that is being
12523 2. buffer/global/flat_store
12524 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12526 - If CU wavefront execution
12529 - Could be split into
12531 vmcnt(0) and s_waitcnt
12534 independently moved
12537 - s_waitcnt vmcnt(0)
12540 global/generic load/load
12542 atomicrmw-with-return-value.
12543 - s_waitcnt vscnt(0)
12547 store/store atomic/
12548 atomicrmw-no-return-value.
12549 - Must happen before
12557 store that is being
12561 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
12562 - system - generic vmcnt(0) & vscnt(0)
12568 - Could be split into
12570 vmcnt(0), s_waitcnt vscnt(0)
12572 lgkmcnt(0) to allow
12574 independently moved
12577 - s_waitcnt vmcnt(0)
12583 atomicrmw-with-return-value.
12584 - s_waitcnt vscnt(0)
12588 store/store atomic/
12589 atomicrmw-no-return-value.
12590 - s_waitcnt lgkmcnt(0)
12597 - Must happen before
12605 store that is being
12608 2. buffer/global/flat_store
12609 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
12610 - wavefront - local
12612 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12613 - generic vmcnt(0) & vscnt(0)
12615 - If CU wavefront execution
12616 mode, omit vmcnt(0) and
12618 - If OpenCL, omit lgkmcnt(0).
12619 - Could be split into
12621 vmcnt(0), s_waitcnt
12622 vscnt(0) and s_waitcnt
12623 lgkmcnt(0) to allow
12625 independently moved
12628 - s_waitcnt vmcnt(0)
12631 global/generic load/load
12633 atomicrmw-with-return-value.
12634 - s_waitcnt vscnt(0)
12640 atomicrmw-no-return-value.
12641 - s_waitcnt lgkmcnt(0)
12648 - Must happen before
12659 2. buffer/global/flat_atomic
12660 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12662 - If CU wavefront execution
12665 - Could be split into
12667 vmcnt(0) and s_waitcnt
12670 independently moved
12673 - s_waitcnt vmcnt(0)
12676 global/generic load/load
12678 atomicrmw-with-return-value.
12679 - s_waitcnt vscnt(0)
12683 store/store atomic/
12684 atomicrmw-no-return-value.
12685 - Must happen before
12693 store that is being
12697 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
12698 - system - generic vmcnt(0) & vscnt(0)
12702 - Could be split into
12704 vmcnt(0), s_waitcnt
12705 vscnt(0) and s_waitcnt
12706 lgkmcnt(0) to allow
12708 independently moved
12711 - s_waitcnt vmcnt(0)
12716 atomicrmw-with-return-value.
12717 - s_waitcnt vscnt(0)
12721 store/store atomic/
12722 atomicrmw-no-return-value.
12723 - s_waitcnt lgkmcnt(0)
12730 - Must happen before
12735 to global and local
12741 2. buffer/global/flat_atomic
12742 fence release - singlethread *none* *none*
12744 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12745 vmcnt(0) & vscnt(0)
12747 - If CU wavefront execution
12748 mode, omit vmcnt(0) and
12757 vmcnt(0) and vscnt(0).
12758 - However, since LLVM
12763 always generate. If
12773 - Could be split into
12775 vmcnt(0), s_waitcnt
12776 vscnt(0) and s_waitcnt
12777 lgkmcnt(0) to allow
12779 independently moved
12782 - s_waitcnt vmcnt(0)
12788 atomicrmw-with-return-value.
12789 - s_waitcnt vscnt(0)
12793 store/store atomic/
12794 atomicrmw-no-return-value.
12795 - s_waitcnt lgkmcnt(0)
12800 atomic/store atomic/
12802 - Must happen before
12803 any following store
12807 and memory ordering
12811 fence-paired-atomic).
12818 fence-paired-atomic.
12820 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
12821 - system vmcnt(0) & vscnt(0)
12830 vmcnt(0) and vscnt(0).
12831 - However, since LLVM
12836 always generate. If
12846 - Could be split into
12848 vmcnt(0), s_waitcnt
12849 vscnt(0) and s_waitcnt
12850 lgkmcnt(0) to allow
12852 independently moved
12855 - s_waitcnt vmcnt(0)
12860 atomicrmw-with-return-value.
12861 - s_waitcnt vscnt(0)
12865 store/store atomic/
12866 atomicrmw-no-return-value.
12867 - s_waitcnt lgkmcnt(0)
12874 - Must happen before
12875 any following store
12879 and memory ordering
12883 fence-paired-atomic).
12890 fence-paired-atomic.
12892 **Acquire-Release Atomic**
12893 ------------------------------------------------------------------------------------
12894 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
12895 - wavefront - local
12897 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12898 vmcnt(0) & vscnt(0)
12900 - If CU wavefront execution
12901 mode, omit vmcnt(0) and
12905 - Must happen after
12911 - Could be split into
12913 vmcnt(0), s_waitcnt
12914 vscnt(0), and s_waitcnt
12915 lgkmcnt(0) to allow
12917 independently moved
12920 - s_waitcnt vmcnt(0)
12923 global/generic load/load
12925 atomicrmw-with-return-value.
12926 - s_waitcnt vscnt(0)
12932 atomicrmw-no-return-value.
12933 - s_waitcnt lgkmcnt(0)
12940 - Must happen before
12951 2. buffer/global_atomic
12952 3. s_waitcnt vm/vscnt(0)
12954 - If CU wavefront execution
12956 - Use vmcnt(0) if atomic with
12957 return and vscnt(0) if
12958 atomic with no-return.
12959 - Must happen before
12971 - If CU wavefront execution
12978 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12980 - If CU wavefront execution
12983 - Could be split into
12985 vmcnt(0) and s_waitcnt
12988 independently moved
12991 - s_waitcnt vmcnt(0)
12994 global/generic load/load
12996 atomicrmw-with-return-value.
12997 - s_waitcnt vscnt(0)
13001 store/store atomic/
13002 atomicrmw-no-return-value.
13003 - Must happen before
13011 store that is being
13015 3. s_waitcnt lgkmcnt(0)
13018 - Must happen before
13024 older than the local load
13030 - If CU wavefront execution
13038 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) &
13039 vmcnt(0) & vscnt(0)
13041 - If CU wavefront execution
13042 mode, omit vmcnt(0) and
13044 - If OpenCL, omit lgkmcnt(0).
13045 - Could be split into
13047 vmcnt(0), s_waitcnt
13048 vscnt(0) and s_waitcnt
13049 lgkmcnt(0) to allow
13051 independently moved
13054 - s_waitcnt vmcnt(0)
13057 global/generic load/load
13059 atomicrmw-with-return-value.
13060 - s_waitcnt vscnt(0)
13066 atomicrmw-no-return-value.
13067 - s_waitcnt lgkmcnt(0)
13074 - Must happen before
13086 3. s_waitcnt lgkmcnt(0) &
13087 vmcnt(0) & vscnt(0)
13089 - If CU wavefront execution
13090 mode, omit vmcnt(0) and
13092 - If OpenCL, omit lgkmcnt(0).
13093 - Must happen before
13099 older than the load
13105 - If CU wavefront execution
13112 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
13113 - system vmcnt(0) & vscnt(0)
13117 - Could be split into
13119 vmcnt(0), s_waitcnt
13120 vscnt(0) and s_waitcnt
13121 lgkmcnt(0) to allow
13123 independently moved
13126 - s_waitcnt vmcnt(0)
13131 atomicrmw-with-return-value.
13132 - s_waitcnt vscnt(0)
13136 store/store atomic/
13137 atomicrmw-no-return-value.
13138 - s_waitcnt lgkmcnt(0)
13145 - Must happen before
13156 2. buffer/global_atomic
13157 3. s_waitcnt vm/vscnt(0)
13159 - Use vmcnt(0) if atomic with
13160 return and vscnt(0) if
13161 atomic with no-return.
13162 - Must happen before
13174 - Must happen before
13184 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
13185 - system vmcnt(0) & vscnt(0)
13189 - Could be split into
13191 vmcnt(0), s_waitcnt
13192 vscnt(0), and s_waitcnt
13193 lgkmcnt(0) to allow
13195 independently moved
13198 - s_waitcnt vmcnt(0)
13203 atomicrmw-with-return-value.
13204 - s_waitcnt vscnt(0)
13208 store/store atomic/
13209 atomicrmw-no-return-value.
13210 - s_waitcnt lgkmcnt(0)
13217 - Must happen before
13229 3. s_waitcnt vm/vscnt(0) &
13234 - Use vmcnt(0) if atomic with
13235 return and vscnt(0) if
13236 atomic with no-return.
13237 - Must happen before
13249 - Must happen before
13259 fence acq_rel - singlethread *none* *none*
13261 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
13262 vmcnt(0) & vscnt(0)
13264 - If CU wavefront execution
13265 mode, omit vmcnt(0) and
13274 vmcnt(0) and vscnt(0).
13284 - Could be split into
13286 vmcnt(0), s_waitcnt
13287 vscnt(0) and s_waitcnt
13288 lgkmcnt(0) to allow
13290 independently moved
13293 - s_waitcnt vmcnt(0)
13299 atomicrmw-with-return-value.
13300 - s_waitcnt vscnt(0)
13304 store/store atomic/
13305 atomicrmw-no-return-value.
13306 - s_waitcnt lgkmcnt(0)
13311 atomic/store atomic/
13313 - Must happen before
13332 and memory ordering
13336 acquire-fence-paired-atomic)
13349 local/generic store
13353 and memory ordering
13357 release-fence-paired-atomic).
13361 - Must happen before
13365 acquire-fence-paired
13366 atomic has completed
13367 before invalidating
13371 locations read must
13375 acquire-fence-paired-atomic.
13379 - If CU wavefront execution
13386 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
13387 - system vmcnt(0) & vscnt(0)
13396 vmcnt(0) and vscnt(0).
13397 - However, since LLVM
13405 - Could be split into
13407 vmcnt(0), s_waitcnt
13408 vscnt(0) and s_waitcnt
13409 lgkmcnt(0) to allow
13411 independently moved
13414 - s_waitcnt vmcnt(0)
13420 atomicrmw-with-return-value.
13421 - s_waitcnt vscnt(0)
13425 store/store atomic/
13426 atomicrmw-no-return-value.
13427 - s_waitcnt lgkmcnt(0)
13434 - Must happen before
13439 global/local/generic
13444 and memory ordering
13448 acquire-fence-paired-atomic)
13450 before invalidating
13460 global/local/generic
13465 and memory ordering
13469 release-fence-paired-atomic).
13477 - Must happen before
13491 **Sequential Consistent Atomic**
13492 ------------------------------------------------------------------------------------
13493 load atomic seq_cst - singlethread - global *Same as corresponding
13494 - wavefront - local load atomic acquire,
13495 - generic except must generate
13496 all instructions even
13498 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) &
13499 - generic vmcnt(0) & vscnt(0)
13501 - If CU wavefront execution
13502 mode, omit vmcnt(0) and
13504 - Could be split into
13506 vmcnt(0), s_waitcnt
13507 vscnt(0), and s_waitcnt
13508 lgkmcnt(0) to allow
13510 independently moved
13513 - s_waitcnt lgkmcnt(0) must
13520 ordering of seq_cst
13526 lgkmcnt(0) and so do
13529 - s_waitcnt vmcnt(0)
13532 global/generic load
13534 atomicrmw-with-return-value
13536 ordering of seq_cst
13545 - s_waitcnt vscnt(0)
13548 global/generic store
13550 atomicrmw-no-return-value
13552 ordering of seq_cst
13564 consistent global/local
13565 memory instructions
13571 prevents reordering
13574 seq_cst load. (Note
13580 followed by a store
13587 release followed by
13590 order. The s_waitcnt
13591 could be placed after
13592 seq_store or before
13595 make the s_waitcnt be
13596 as late as possible
13602 instructions same as
13605 except must generate
13606 all instructions even
13608 load atomic seq_cst - workgroup - local
13610 1. s_waitcnt vmcnt(0) & vscnt(0)
13612 - If CU wavefront execution
13614 - Could be split into
13616 vmcnt(0) and s_waitcnt
13619 independently moved
13622 - s_waitcnt vmcnt(0)
13625 global/generic load
13627 atomicrmw-with-return-value
13629 ordering of seq_cst
13638 - s_waitcnt vscnt(0)
13641 global/generic store
13643 atomicrmw-no-return-value
13645 ordering of seq_cst
13658 memory instructions
13664 prevents reordering
13667 seq_cst load. (Note
13673 followed by a store
13680 release followed by
13683 order. The s_waitcnt
13684 could be placed after
13685 seq_store or before
13688 make the s_waitcnt be
13689 as late as possible
13695 instructions same as
13698 except must generate
13699 all instructions even
13702 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
13703 - system - generic vmcnt(0) & vscnt(0)
13705 - Could be split into
13707 vmcnt(0), s_waitcnt
13708 vscnt(0) and s_waitcnt
13709 lgkmcnt(0) to allow
13711 independently moved
13714 - s_waitcnt lgkmcnt(0)
13721 ordering of seq_cst
13727 lgkmcnt(0) and so do
13730 - s_waitcnt vmcnt(0)
13733 global/generic load
13735 atomicrmw-with-return-value
13737 ordering of seq_cst
13746 - s_waitcnt vscnt(0)
13749 global/generic store
13751 atomicrmw-no-return-value
13753 ordering of seq_cst
13766 memory instructions
13772 prevents reordering
13775 seq_cst load. (Note
13781 followed by a store
13788 release followed by
13791 order. The s_waitcnt
13792 could be placed after
13793 seq_store or before
13796 make the s_waitcnt be
13797 as late as possible
13803 instructions same as
13806 except must generate
13807 all instructions even
13809 store atomic seq_cst - singlethread - global *Same as corresponding
13810 - wavefront - local store atomic release,
13811 - workgroup - generic except must generate
13812 - agent all instructions even
13813 - system for OpenCL.*
13814 atomicrmw seq_cst - singlethread - global *Same as corresponding
13815 - wavefront - local atomicrmw acq_rel,
13816 - workgroup - generic except must generate
13817 - agent all instructions even
13818 - system for OpenCL.*
13819 fence seq_cst - singlethread *none* *Same as corresponding
13820 - wavefront fence acq_rel,
13821 - workgroup except must generate
13822 - agent all instructions even
13823 - system for OpenCL.*
13824 ============ ============ ============== ========== ================================
13826 .. _amdgpu-amdhsa-trap-handler-abi:
13831 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
13832 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
13833 supports the ``s_trap`` instruction. For usage see:
13835 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
13836 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
13837 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
13839 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
13840 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
13842 =================== =============== =============== =======================================
13843 Usage Code Sequence Trap Handler Description
13845 =================== =============== =============== =======================================
13846 reserved ``s_trap 0x00`` Reserved by hardware.
13847 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap``
13848 ``queue_ptr`` intrinsic (not implemented).
13851 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
13852 ``queue_ptr`` the trap instruction. The associated
13853 queue is signalled to put it into the
13854 error state. When the queue is put in
13855 the error state, the waves executing
13856 dispatches on the queue will be
13858 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
13859 as a no-operation. The trap handler
13860 is entered and immediately returns to
13861 continue execution of the wavefront.
13862 - If the debugger is enabled, causes
13863 the debug trap to be reported by the
13864 debugger and the wavefront is put in
13865 the halt state with the PC at the
13866 instruction. The debugger must
13867 increment the PC and resume the wave.
13868 reserved ``s_trap 0x04`` Reserved.
13869 reserved ``s_trap 0x05`` Reserved.
13870 reserved ``s_trap 0x06`` Reserved.
13871 reserved ``s_trap 0x07`` Reserved.
13872 reserved ``s_trap 0x08`` Reserved.
13873 reserved ``s_trap 0xfe`` Reserved.
13874 reserved ``s_trap 0xff`` Reserved.
13875 =================== =============== =============== =======================================
13879 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
13880 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
13882 =================== =============== =============== =======================================
13883 Usage Code Sequence Trap Handler Description
13885 =================== =============== =============== =======================================
13886 reserved ``s_trap 0x00`` Reserved by hardware.
13887 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for
13888 breakpoints. Causes wave to be halted
13889 with the PC at the trap instruction.
13890 The debugger is responsible to resume
13891 the wave, including the instruction
13892 that the breakpoint overwrote.
13893 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
13894 ``queue_ptr`` the trap instruction. The associated
13895 queue is signalled to put it into the
13896 error state. When the queue is put in
13897 the error state, the waves executing
13898 dispatches on the queue will be
13900 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
13901 as a no-operation. The trap handler
13902 is entered and immediately returns to
13903 continue execution of the wavefront.
13904 - If the debugger is enabled, causes
13905 the debug trap to be reported by the
13906 debugger and the wavefront is put in
13907 the halt state with the PC at the
13908 instruction. The debugger must
13909 increment the PC and resume the wave.
13910 reserved ``s_trap 0x04`` Reserved.
13911 reserved ``s_trap 0x05`` Reserved.
13912 reserved ``s_trap 0x06`` Reserved.
13913 reserved ``s_trap 0x07`` Reserved.
13914 reserved ``s_trap 0x08`` Reserved.
13915 reserved ``s_trap 0xfe`` Reserved.
13916 reserved ``s_trap 0xff`` Reserved.
13917 =================== =============== =============== =======================================
13921 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
13922 :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
13924 =================== =============== ================ ================= =======================================
13925 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
13926 =================== =============== ================ ================= =======================================
13927 reserved ``s_trap 0x00`` Reserved by hardware.
13928 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for
13929 breakpoints. Causes wave to be halted
13930 with the PC at the trap instruction.
13931 The debugger is responsible to resume
13932 the wave, including the instruction
13933 that the breakpoint overwrote.
13934 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at
13935 ``queue_ptr`` the trap instruction. The associated
13936 queue is signalled to put it into the
13937 error state. When the queue is put in
13938 the error state, the waves executing
13939 dispatches on the queue will be
13941 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves
13942 as a no-operation. The trap handler
13943 is entered and immediately returns to
13944 continue execution of the wavefront.
13945 - If the debugger is enabled, causes
13946 the debug trap to be reported by the
13947 debugger and the wavefront is put in
13948 the halt state with the PC at the
13949 instruction. The debugger must
13950 increment the PC and resume the wave.
13951 reserved ``s_trap 0x04`` Reserved.
13952 reserved ``s_trap 0x05`` Reserved.
13953 reserved ``s_trap 0x06`` Reserved.
13954 reserved ``s_trap 0x07`` Reserved.
13955 reserved ``s_trap 0x08`` Reserved.
13956 reserved ``s_trap 0xfe`` Reserved.
13957 reserved ``s_trap 0xff`` Reserved.
13958 =================== =============== ================ ================= =======================================
13960 .. _amdgpu-amdhsa-function-call-convention:
13967 This section is currently incomplete and has inaccuracies. It is WIP that will
13968 be updated as information is determined.
13970 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
13971 addresses. Unswizzled addresses are normal linear addresses.
13973 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
13978 This section describes the call convention ABI for the outer kernel function.
13980 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
13983 The following is not part of the AMDGPU kernel calling convention but describes
13984 how the AMDGPU implements function calls:
13986 1. Clang decides the kernarg layout to match the *HSA Programmer's Language
13989 - All structs are passed directly.
13990 - Lambda values are passed *TBA*.
13994 - Does this really follow HSA rules? Or are structs >16 bytes passed
13996 - What is ABI for lambda values?
13998 4. The kernel performs certain setup in its prolog, as described in
13999 :ref:`amdgpu-amdhsa-kernel-prolog`.
14001 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
14003 Non-Kernel Functions
14004 ++++++++++++++++++++
14006 This section describes the call convention ABI for functions other than the
14007 outer kernel function.
14009 If a kernel has function calls then scratch is always allocated and used for
14010 the call stack which grows from low address to high address using the swizzled
14011 scratch address space.
14013 On entry to a function:
14015 1. SGPR0-3 contain a V# with the following properties (see
14016 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
14018 * Base address pointing to the beginning of the wavefront scratch backing
14020 * Swizzled with dword element size and stride of wavefront size elements.
14022 2. The FLAT_SCRATCH register pair is setup. See
14023 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
14024 3. GFX6-GFX8: M0 register set to the size of LDS in bytes. See
14025 :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
14026 4. The EXEC register is set to the lanes active on entry to the function.
14027 5. MODE register: *TBD*
14028 6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
14030 7. SGPR30-31 return address (RA). The code address that the function must
14031 return to when it completes. The value is undefined if the function is *no
14033 8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
14034 offset relative to the beginning of the wavefront scratch backing memory.
14036 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
14037 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
14040 The unswizzled SP value can be converted into the swizzled SP value by:
14042 | swizzled SP = unswizzled SP / wavefront size
14044 This may be used to obtain the private address space address of stack
14045 objects and to convert this address to a flat address by adding the flat
14046 scratch aperture base address.
14048 The swizzled SP value is always 4 bytes aligned for the ``r600``
14049 architecture and 16 byte aligned for the ``amdgcn`` architecture.
14053 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
14054 OpenCL language which has the largest base type defined as 16 bytes.
14056 On entry, the swizzled SP value is the address of the first function
14057 argument passed on the stack. Other stack passed arguments are positive
14058 offsets from the entry swizzled SP value.
14060 The function may use positive offsets beyond the last stack passed argument
14061 for stack allocated local variables and register spill slots. If necessary,
14062 the function may align these to greater alignment than 16 bytes. After these
14063 the function may dynamically allocate space for such things as runtime sized
14064 ``alloca`` local allocations.
14066 If the function calls another function, it will place any stack allocated
14067 arguments after the last local allocation and adjust SGPR32 to the address
14068 after the last local allocation.
14070 9. All other registers are unspecified.
14071 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
14073 11. Use pass-by-reference (byref) in stead of pass-by-value (byval) for struct
14074 arguments in C ABI. Callee is responsible for allocating stack memory and
14075 copying the value of the struct if modified. Note that the backend still
14076 supports byval for struct arguments.
14078 On exit from a function:
14080 1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
14081 described below. Any registers used are considered clobbered registers.
14082 2. The following registers are preserved and have the same value as on entry:
14087 * All SGPR registers except the clobbered registers of SGPR4-31.
14105 Except the argument registers, the VGPRs clobbered and the preserved
14106 registers are intermixed at regular intervals in order to keep a
14107 similar ratio independent of the number of allocated VGPRs.
14109 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
14110 * Lanes of all VGPRs that are inactive at the call site.
14112 For the AMDGPU backend, an inter-procedural register allocation (IPRA)
14113 optimization may mark some of clobbered SGPR and VGPR registers as
14114 preserved if it can be determined that the called function does not change
14117 2. The PC is set to the RA provided on entry.
14118 3. MODE register: *TBD*.
14119 4. All other registers are clobbered.
14120 5. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
14121 function is available to the caller.
14125 - How are function results returned? The address of structured types is passed
14126 by reference, but what about other types?
14128 The function input arguments are made up of the formal arguments explicitly
14129 declared by the source language function plus the implicit input arguments used
14130 by the implementation.
14132 The source language input arguments are:
14134 1. Any source language implicit ``this`` or ``self`` argument comes first as a
14136 2. Followed by the function formal arguments in left to right source order.
14138 The source language result arguments are:
14140 1. The function result argument.
14142 The source language input or result struct type arguments that are less than or
14143 equal to 16 bytes, are decomposed recursively into their base type fields, and
14144 each field is passed as if a separate argument. For input arguments, if the
14145 called function requires the struct to be in memory, for example because its
14146 address is taken, then the function body is responsible for allocating a stack
14147 location and copying the field arguments into it. Clang terms this *direct
14150 The source language input struct type arguments that are greater than 16 bytes,
14151 are passed by reference. The caller is responsible for allocating a stack
14152 location to make a copy of the struct value and pass the address as the input
14153 argument. The called function is responsible to perform the dereference when
14154 accessing the input argument. Clang terms this *by-value struct*.
14156 A source language result struct type argument that is greater than 16 bytes, is
14157 returned by reference. The caller is responsible for allocating a stack location
14158 to hold the result value and passes the address as the last input argument
14159 (before the implicit input arguments). In this case there are no result
14160 arguments. The called function is responsible to perform the dereference when
14161 storing the result value. Clang terms this *structured return (sret)*.
14163 *TODO: correct the ``sret`` definition.*
14167 Is this definition correct? Or is ``sret`` only used if passing in registers, and
14168 pass as non-decomposed struct as stack argument? Or something else? Is the
14169 memory location in the caller stack frame, or a stack memory argument and so
14170 no address is passed as the caller can directly write to the argument stack
14171 location? But then the stack location is still live after return. If an
14172 argument stack location is it the first stack argument or the last one?
14174 Lambda argument types are treated as struct types with an implementation defined
14179 Need to specify the ABI for lambda types for AMDGPU.
14181 For AMDGPU backend all source language arguments (including the decomposed
14182 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
14183 they are passed in SGPRs.
14185 The AMDGPU backend walks the function call graph from the leaves to determine
14186 which implicit input arguments are used, propagating to each caller of the
14187 function. The used implicit arguments are appended to the function arguments
14188 after the source language arguments in the following order:
14192 Is recursion or external functions supported?
14194 1. Work-Item ID (1 VGPR)
14196 The X, Y and Z work-item ID are packed into a single VGRP with the following
14197 layout. Only fields actually used by the function are set. The other bits
14200 The values come from the initial kernel execution state. See
14201 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14203 .. table:: Work-item implicit argument layout
14204 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
14206 ======= ======= ==============
14207 Bits Size Field Name
14208 ======= ======= ==============
14209 9:0 10 bits X Work-Item ID
14210 19:10 10 bits Y Work-Item ID
14211 29:20 10 bits Z Work-Item ID
14212 31:30 2 bits Unused
14213 ======= ======= ==============
14215 2. Dispatch Ptr (2 SGPRs)
14217 The value comes from the initial kernel execution state. See
14218 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14220 3. Queue Ptr (2 SGPRs)
14222 The value comes from the initial kernel execution state. See
14223 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14225 4. Kernarg Segment Ptr (2 SGPRs)
14227 The value comes from the initial kernel execution state. See
14228 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14230 5. Dispatch id (2 SGPRs)
14232 The value comes from the initial kernel execution state. See
14233 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14235 6. Work-Group ID X (1 SGPR)
14237 The value comes from the initial kernel execution state. See
14238 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14240 7. Work-Group ID Y (1 SGPR)
14242 The value comes from the initial kernel execution state. See
14243 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14245 8. Work-Group ID Z (1 SGPR)
14247 The value comes from the initial kernel execution state. See
14248 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14250 9. Implicit Argument Ptr (2 SGPRs)
14252 The value is computed by adding an offset to Kernarg Segment Ptr to get the
14253 global address space pointer to the first kernarg implicit argument.
14255 The input and result arguments are assigned in order in the following manner:
14259 There are likely some errors and omissions in the following description that
14264 Check the Clang source code to decipher how function arguments and return
14265 results are handled. Also see the AMDGPU specific values used.
14267 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
14270 If there are more arguments than will fit in these registers, the remaining
14271 arguments are allocated on the stack in order on naturally aligned
14276 How are overly aligned structures allocated on the stack?
14278 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
14281 If there are more arguments than will fit in these registers, the remaining
14282 arguments are allocated on the stack in order on naturally aligned
14285 Note that decomposed struct type arguments may have some fields passed in
14286 registers and some in memory.
14290 So, a struct which can pass some fields as decomposed register arguments, will
14291 pass the rest as decomposed stack elements? But an argument that will not start
14292 in registers will not be decomposed and will be passed as a non-decomposed
14295 The following is not part of the AMDGPU function calling convention but
14296 describes how the AMDGPU implements function calls:
14298 1. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
14299 unswizzled scratch address. It is only needed if runtime sized ``alloca``
14300 are used, or for the reasons defined in ``SIFrameLowering``.
14301 2. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
14302 to access the incoming stack arguments in the function. The BP is needed
14303 only when the function requires the runtime stack alignment.
14305 3. Allocating SGPR arguments on the stack are not supported.
14307 4. No CFI is currently generated. See
14308 :ref:`amdgpu-dwarf-call-frame-information`.
14312 CFI will be generated that defines the CFA as the unswizzled address
14313 relative to the wave scratch base in the unswizzled private address space
14314 of the lowest address stack allocated local variable.
14316 ``DW_AT_frame_base`` will be defined as the swizzled address in the
14317 swizzled private address space by dividing the CFA by the wavefront size
14318 (since CFA is always at least dword aligned which matches the scratch
14319 swizzle element size).
14321 If no dynamic stack alignment was performed, the stack allocated arguments
14322 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
14323 local variables and register spill slots are accessed as positive offsets
14324 relative to ``DW_AT_frame_base``.
14326 5. Function argument passing is implemented by copying the input physical
14327 registers to virtual registers on entry. The register allocator can spill if
14328 necessary. These are copied back to physical registers at call sites. The
14329 net effect is that each function call can have these values in entirely
14330 distinct locations. The IPRA can help avoid shuffling argument registers.
14331 6. Call sites are implemented by setting up the arguments at positive offsets
14332 from SP. Then SP is incremented to account for the known frame size before
14333 the call and decremented after the call.
14337 The CFI will reflect the changed calculation needed to compute the CFA
14340 7. 4 byte spill slots are used in the stack frame. One slot is allocated for an
14341 emergency spill slot. Buffer instructions are used for stack accesses and
14342 not the ``flat_scratch`` instruction.
14346 Explain when the emergency spill slot is used.
14350 Possible broken issues:
14352 - Stack arguments must be aligned to required alignment.
14353 - Stack is aligned to max(16, max formal argument alignment)
14354 - Direct argument < 64 bits should check register budget.
14355 - Register budget calculation should respect ``inreg`` for SGPR.
14356 - SGPR overflow is not handled.
14357 - struct with 1 member unpeeling is not checking size of member.
14358 - ``sret`` is after ``this`` pointer.
14359 - Caller is not implementing stack realignment: need an extra pointer.
14360 - Should say AMDGPU passes FP rather than SP.
14361 - Should CFI define CFA as address of locals or arguments. Difference is
14362 apparent when have implemented dynamic alignment.
14363 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
14364 highest address of stack frame and use negative offset for locals. Would
14365 allow SP to be the same as FP and could support signal-handler-like as now
14366 have a real SP for the top of the stack.
14367 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
14373 This section provides code conventions used when the target triple OS is
14374 ``amdpal`` (see :ref:`amdgpu-target-triples`).
14376 .. _amdgpu-amdpal-code-object-metadata-section:
14378 Code Object Metadata
14379 ~~~~~~~~~~~~~~~~~~~~
14383 The metadata is currently in development and is subject to major
14384 changes. Only the current version is supported. *When this document
14385 was generated the version was 2.6.*
14387 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
14388 record (see :ref:`amdgpu-note-records-v3-onwards`).
14390 The metadata is represented as Message Pack formatted binary data (see
14391 [MsgPack]_). The top level is a Message Pack map that includes the keys
14392 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
14393 and referenced tables.
14395 Additional information can be added to the maps. To avoid conflicts, any
14396 key names should be prefixed by "*vendor-name*." where ``vendor-name``
14397 can be the name of the vendor and specific vendor tool that generates the
14398 information. The prefix is abbreviated to simply "." when it appears
14399 within a map that has been added by the same *vendor-name*.
14401 .. table:: AMDPAL Code Object Metadata Map
14402 :name: amdgpu-amdpal-code-object-metadata-map-table
14404 =================== ============== ========= ======================================================================
14405 String Key Value Type Required? Description
14406 =================== ============== ========= ======================================================================
14407 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values
14408 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
14409 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See
14410 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
14411 definition of the keys included in that map.
14412 =================== ============== ========= ======================================================================
14416 .. table:: AMDPAL Code Object Pipeline Metadata Map
14417 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
14419 ====================================== ============== ========= ===================================================
14420 String Key Value Type Required? Description
14421 ====================================== ============== ========= ===================================================
14422 ".name" string Source name of the pipeline.
14423 ".type" string Pipeline type, e.g. VsPs. Values include:
14433 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower
14434 2 integers 64 bits is the "stable" portion of the hash, used
14435 for e.g. shader replacement lookup. Upper 64 bits
14436 is the "unique" portion of the hash, used for
14437 e.g. pipeline cache lookup. The value is
14438 implementation defined, and can not be relied on
14439 between different builds of the compiler.
14440 ".shaders" map Per-API shader metadata. See
14441 :ref:`amdgpu-amdpal-code-object-shader-map-table`
14442 for the definition of the keys included in that
14444 ".hardware_stages" map Per-hardware stage metadata. See
14445 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
14446 for the definition of the keys included in that
14448 ".shader_functions" map Per-shader function metadata. See
14449 :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
14450 for the definition of the keys included in that
14452 ".registers" map Required Hardware register configuration. See
14453 :ref:`amdgpu-amdpal-code-object-register-map-table`
14454 for the definition of the keys included in that
14456 ".user_data_limit" integer Number of user data entries accessed by this
14458 ".spill_threshold" integer The user data spill threshold. 0xFFFF for
14459 NoUserDataSpilling.
14460 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the
14461 viewport array index feature. Pipelines which use
14462 this feature can render into all 16 viewports,
14463 whereas pipelines which do not use it are
14464 restricted to viewport #0.
14465 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for
14466 handling data-passing between the ES and GS
14467 shader stages. This can be zero if the data is
14468 passed using off-chip buffers. This value should
14469 be used to program all user-SGPRs which have been
14470 marked with "UserDataMapping::EsGsLdsSize"
14471 (typically only the GS and VS HW stages will ever
14472 have a user-SGPR so marked).
14473 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders
14474 (maximum number of threads in a subgroup).
14475 ".num_interpolants" integer Graphics only. Number of PS interpolants.
14476 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used.
14477 ".api" string Name of the client graphics API.
14478 ".api_create_info" binary Graphics API shader create info binary blob. Can
14479 be defined by the driver using the compiler if
14480 they want to be able to correlate API-specific
14481 information used during creation at a later time.
14482 ====================================== ============== ========= ===================================================
14486 .. table:: AMDPAL Code Object Shader Map
14487 :name: amdgpu-amdpal-code-object-shader-map-table
14490 +-------------+--------------+-------------------------------------------------------------------+
14491 |String Key |Value Type |Description |
14492 +=============+==============+===================================================================+
14493 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
14494 |- ".vertex" | |for the definition of the keys included in that map. |
14497 |- ".geometry"| | |
14499 +-------------+--------------+-------------------------------------------------------------------+
14503 .. table:: AMDPAL Code Object API Shader Metadata Map
14504 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
14506 ==================== ============== ========= =====================================================================
14507 String Key Value Type Required? Description
14508 ==================== ============== ========= =====================================================================
14509 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value
14510 2 integers is implementation defined, and can not be relied on between
14511 different builds of the compiler.
14512 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values
14523 ==================== ============== ========= =====================================================================
14527 .. table:: AMDPAL Code Object Hardware Stage Map
14528 :name: amdgpu-amdpal-code-object-hardware-stage-map-table
14530 +-------------+--------------+-----------------------------------------------------------------------+
14531 |String Key |Value Type |Description |
14532 +=============+==============+=======================================================================+
14533 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
14534 |- ".hs" | |for the definition of the keys included in that map. |
14540 +-------------+--------------+-----------------------------------------------------------------------+
14544 .. table:: AMDPAL Code Object Hardware Stage Metadata Map
14545 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
14547 ========================== ============== ========= ===============================================================
14548 String Key Value Type Required? Description
14549 ========================== ============== ========= ===============================================================
14550 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
14551 ".scratch_memory_size" integer Scratch memory size in bytes.
14552 ".lds_size" integer Local Data Share size in bytes.
14553 ".perf_data_buffer_size" integer Performance data buffer size in bytes.
14554 ".vgpr_count" integer Number of VGPRs used.
14555 ".agpr_count" integer Number of AGPRs used.
14556 ".sgpr_count" integer Number of SGPRs used.
14557 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
14558 directive to instruct the compiler to limit the VGPR usage to
14559 be less than or equal to the specified value (only set if
14560 different from HW default).
14561 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW
14563 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
14565 ".wavefront_size" integer Wavefront size (only set if different from HW default).
14566 ".uses_uavs" boolean The shader reads or writes UAVs.
14567 ".uses_rovs" boolean The shader reads or writes ROVs.
14568 ".writes_uavs" boolean The shader writes to one or more UAVs.
14569 ".writes_depth" boolean The shader writes out a depth value.
14570 ".uses_append_consume" boolean The shader uses append and/or consume operations, either
14572 ".uses_prim_id" boolean The shader uses PrimID.
14573 ========================== ============== ========= ===============================================================
14577 .. table:: AMDPAL Code Object Shader Function Map
14578 :name: amdgpu-amdpal-code-object-shader-function-map-table
14580 =============== ============== ====================================================================
14581 String Key Value Type Description
14582 =============== ============== ====================================================================
14583 *symbol name* map *symbol name* is the ELF symbol name of the shader function code
14584 entry address. The value is the function's metadata. See
14585 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
14586 =============== ============== ====================================================================
14590 .. table:: AMDPAL Code Object Shader Function Metadata Map
14591 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
14593 ============================= ============== =================================================================
14594 String Key Value Type Description
14595 ============================= ============== =================================================================
14596 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value
14597 2 integers is implementation defined, and can not be relied on between
14598 different builds of the compiler.
14599 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader.
14600 ".lds_size" integer Size in bytes of LDS memory.
14601 ".vgpr_count" integer Number of VGPRs used by the shader.
14602 ".sgpr_count" integer Number of SGPRs used by the shader.
14603 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader.
14604 ".shader_subtype" string Shader subtype/kind. Values include:
14608 ============================= ============== =================================================================
14612 .. table:: AMDPAL Code Object Register Map
14613 :name: amdgpu-amdpal-code-object-register-map-table
14615 ========================== ============== ====================================================================
14616 32-bit Integer Key Value Type Description
14617 ========================== ============== ====================================================================
14618 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
14619 a GRBM register (i.e., driver accessible GPU register number, not
14620 shader GPR register number). The driver is required to program each
14621 specified register to the corresponding specified value when
14622 executing this pipeline. Typically, the ``reg offsets`` are the
14623 ``uint16_t`` offsets to each register as defined by the hardware
14624 chip headers. The register is set to the provided value. However, a
14625 ``reg offset`` that specifies a user data register (e.g.,
14626 COMPUTE_USER_DATA_0) needs special treatment. See
14627 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
14629 ========================== ============== ====================================================================
14631 .. _amdgpu-amdpal-code-object-user-data-section:
14636 Each hardware stage has a set of 32-bit physical SPI *user data registers*
14637 (either 16 or 32 based on graphics IP and the stage) which can be
14638 written from a command buffer and then loaded into SGPRs when waves are
14639 launched via a subsequent dispatch or draw operation. This is the way
14640 most arguments are passed from the application/runtime to a hardware
14643 PAL abstracts this functionality by exposing a set of 128 *user data
14644 entries* per pipeline a client can use to pass arguments from a command
14645 buffer to one or more shaders in that pipeline. The ELF code object must
14646 specify a mapping from virtualized *user data entries* to physical *user
14647 data registers*, and PAL is responsible for implementing that mapping,
14648 including spilling overflow *user data entries* to memory if needed.
14650 Since the *user data registers* are GRBM-accessible SPI registers, this
14651 mapping is actually embedded in the ``.registers`` metadata entry. For
14652 most registers, the value in that map is a literal 32-bit value that
14653 should be written to the register by the driver. However, when the
14654 register is a *user data register* (any USER_DATA register e.g.,
14655 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
14656 the driver to write either a *user data entry* value or one of several
14657 driver-internal values to the register. This encoding is described in
14658 the following table:
14662 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
14663 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
14664 always be programmed to the address of the GlobalTable, and *user data
14665 register* 1 must always be programmed to the address of the PerShaderTable.
14669 .. table:: AMDPAL User Data Mapping
14670 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14672 ========== ================= ===============================================================================
14673 Value Name Description
14674 ========== ================= ===============================================================================
14675 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14676 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should
14677 always point to *user data register* 0).
14678 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See
14679 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14680 for more detail (should always point to *user data register* 1).
14681 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See
14682 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14684 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14685 reference the draw index in the vertex shader. Only supported by the first
14686 stage in a graphics pipeline.
14687 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in
14688 a graphics pipeline.
14689 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a
14691 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
14692 a buffer containing the grid dimensions for a Compute dispatch operation. The
14693 high half of the address is stored in the next sequential user-SGPR. Only
14694 supported by compute pipelines.
14695 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS
14696 space used for the ES/GS pseudo-ring-buffer for passing data between shader
14698 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic
14699 pipeline instancing.
14700 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This
14701 can only appear for one shader stage per pipeline.
14702 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer.
14703 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can
14704 only appear for one shader stage per pipeline.
14705 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can
14706 only appear for one shader stage per pipeline (PS). These replace color targets
14707 and are completely separate from any UAVs used by the shader. This is optional,
14708 and only used by the PS when UAV exports are used to replace color-target
14709 exports to optimize specific shaders.
14710 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by
14711 some NGG pipelines to perform culling. This value contains the address of the
14712 first of two consecutive registers which provide the full GPU address.
14713 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine.
14714 ========== ================= ===============================================================================
14716 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
14721 Low 32 bits of the GPU address for an optional buffer in the ``.data``
14722 section of the ELF. The high 32 bits of the address match the high 32 bits
14723 of the shader's program counter.
14725 The buffer can be anything the shader compiler needs it for, and
14726 allows each shader to have its own region of the ``.data`` section.
14727 Typically, this could be a table of buffer SRD's and the data pointed to
14728 by the buffer SRD's, but it could be a flat-address region of memory as
14729 well. Its layout and usage are defined by the shader compiler.
14731 Each shader's table in the ``.data`` section is referenced by the symbol
14732 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the
14733 hardware shader stage the data is for. E.g.,
14734 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
14736 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
14741 It is possible for a hardware shader to need access to more *user data
14742 entries* than there are slots available in user data registers for one
14743 or more hardware shader stages. In that case, the PAL runtime expects
14744 the necessary *user data entries* to be spilled to GPU memory and use
14745 one user data register to point to the spilled user data memory. The
14746 value of the *user data entry* must then represent the location where
14747 a shader expects to read the low 32-bits of the table's GPU virtual
14748 address. The *spill table* itself represents a set of 32-bit values
14749 managed by the PAL runtime in GPU-accessible memory that can be made
14750 indirectly accessible to a hardware shader.
14755 This section provides code conventions used when the target triple OS is
14756 empty (see :ref:`amdgpu-target-triples`).
14761 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
14762 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
14763 instructions are handled as follows:
14765 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
14766 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
14768 =============== =============== ===========================================
14769 Usage Code Sequence Description
14770 =============== =============== ===========================================
14771 llvm.trap s_endpgm Causes wavefront to be terminated.
14772 llvm.debugtrap *none* Compiler warning given that there is no
14773 trap handler installed.
14774 =============== =============== ===========================================
14784 When the language is OpenCL the following differences occur:
14786 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14787 2. The AMDGPU backend appends additional arguments to the kernel's explicit
14788 arguments for the AMDHSA OS (see
14789 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
14790 3. Additional metadata is generated
14791 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
14793 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
14794 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
14796 ======== ==== ========= ===========================================
14797 Position Byte Byte Description
14799 ======== ==== ========= ===========================================
14800 1 8 8 OpenCL Global Offset X
14801 2 8 8 OpenCL Global Offset Y
14802 3 8 8 OpenCL Global Offset Z
14803 4 8 8 OpenCL address of printf buffer
14804 5 8 8 OpenCL address of virtual queue used by
14806 6 8 8 OpenCL address of AqlWrap struct used by
14808 7 8 8 Pointer argument used for Multi-gird
14810 ======== ==== ========= ===========================================
14817 When the language is HCC the following differences occur:
14819 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14821 .. _amdgpu-assembler:
14826 AMDGPU backend has LLVM-MC based assembler which is currently in development.
14827 It supports AMDGCN GFX6-GFX11.
14829 This section describes general syntax for instructions and operands.
14834 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
14836 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
14837 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
14839 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
14840 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
14842 The order of operands and modifiers is fixed.
14843 Most modifiers are optional and may be omitted.
14845 Links to detailed instruction syntax description may be found in the following
14846 table. Note that features under development are not included
14847 in this description.
14849 ============= ============================================= =======================================
14850 Architecture Core ISA ISA Variants and Extensions
14851 ============= ============================================= =======================================
14852 GCN 2 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \-
14853 GCN 3, GCN 4 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \-
14854 GCN 5 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
14856 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
14858 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
14860 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
14862 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
14864 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
14866 CDNA 1 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
14868 CDNA 2 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
14870 CDNA 3 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
14872 :doc:`gfx941<AMDGPU/AMDGPUAsmGFX940>`
14874 :doc:`gfx942<AMDGPU/AMDGPUAsmGFX940>`
14876 RDNA 1 :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
14878 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
14880 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
14882 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
14884 RDNA 2 :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>` :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
14886 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
14888 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
14890 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
14892 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
14894 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
14896 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
14898 RDNA 3 :doc:`GFX11<AMDGPU/AMDGPUAsmGFX11>` :doc:`gfx1100<AMDGPU/AMDGPUAsmGFX11>`
14900 :doc:`gfx1101<AMDGPU/AMDGPUAsmGFX11>`
14902 :doc:`gfx1102<AMDGPU/AMDGPUAsmGFX11>`
14904 :doc:`gfx1103<AMDGPU/AMDGPUAsmGFX11>`
14905 ============= ============================================= =======================================
14907 For more information about instructions, their semantics and supported
14908 combinations of operands, refer to one of instruction set architecture manuals
14909 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
14910 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
14911 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, [AMD-GCN-GFX10-RDNA1]_,
14912 [AMD-GCN-GFX10-RDNA2]_ and [AMD-GCN-GFX11-RDNA3]_.
14917 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
14922 Detailed description of modifiers may be found
14923 :doc:`here<AMDGPUModifierSyntax>`.
14925 Instruction Examples
14926 ~~~~~~~~~~~~~~~~~~~~
14931 .. code-block:: nasm
14933 ds_add_u32 v2, v4 offset:16
14934 ds_write_src2_b64 v2 offset0:4 offset1:8
14935 ds_cmpst_f32 v2, v4, v6
14936 ds_min_rtn_f64 v[8:9], v2, v[4:5]
14938 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
14944 .. code-block:: nasm
14946 flat_load_dword v1, v[3:4]
14947 flat_store_dwordx3 v[3:4], v[5:7]
14948 flat_atomic_swap v1, v[3:4], v5 glc
14949 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
14950 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
14952 For full list of supported instructions, refer to "FLAT instructions" in ISA
14958 .. code-block:: nasm
14960 buffer_load_dword v1, off, s[4:7], s1
14961 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
14962 buffer_store_format_xy v[1:2], off, s[4:7], s1
14964 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
14966 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
14972 .. code-block:: nasm
14974 s_load_dword s1, s[2:3], 0xfc
14975 s_load_dwordx8 s[8:15], s[2:3], s4
14976 s_load_dwordx16 s[88:103], s[2:3], s4
14980 For full list of supported instructions, refer to "Scalar Memory Operations" in
14986 .. code-block:: nasm
14989 s_mov_b64 s[0:1], 0x80000000
14991 s_wqm_b64 s[2:3], s[4:5]
14992 s_bcnt0_i32_b64 s1, s[2:3]
14993 s_swappc_b64 s[2:3], s[4:5]
14994 s_cbranch_join s[4:5]
14996 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
15002 .. code-block:: nasm
15004 s_add_u32 s1, s2, s3
15005 s_and_b64 s[2:3], s[4:5], s[6:7]
15006 s_cselect_b32 s1, s2, s3
15007 s_andn2_b32 s2, s4, s6
15008 s_lshr_b64 s[2:3], s[4:5], s6
15009 s_ashr_i32 s2, s4, s6
15010 s_bfm_b64 s[2:3], s4, s6
15011 s_bfe_i64 s[2:3], s[4:5], s6
15012 s_cbranch_g_fork s[4:5], s[6:7]
15014 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
15020 .. code-block:: nasm
15022 s_cmp_eq_i32 s1, s2
15023 s_bitcmp1_b32 s1, s2
15024 s_bitcmp0_b64 s[2:3], s4
15027 For full list of supported instructions, refer to "SOPC Instructions" in ISA
15033 .. code-block:: nasm
15038 s_waitcnt 0 ; Wait for all counters to be 0
15039 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
15040 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
15044 s_sendmsg sendmsg(MSG_INTERRUPT)
15047 For full list of supported instructions, refer to "SOPP Instructions" in ISA
15050 Unless otherwise mentioned, little verification is performed on the operands
15051 of SOPP Instructions, so it is up to the programmer to be familiar with the
15052 range or acceptable values.
15057 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
15058 the assembler will automatically use optimal encoding based on its operands. To
15059 force specific encoding, one can add a suffix to the opcode of the instruction:
15061 * _e32 for 32-bit VOP1/VOP2/VOPC
15062 * _e64 for 64-bit VOP3
15064 * _e64_dpp for VOP3 with DPP
15065 * _sdwa for VOP_SDWA
15067 VOP1/VOP2/VOP3/VOPC examples:
15069 .. code-block:: nasm
15072 v_mov_b32_e32 v1, v2
15074 v_cvt_f64_i32_e32 v[1:2], v2
15075 v_floor_f32_e32 v1, v2
15076 v_bfrev_b32_e32 v1, v2
15077 v_add_f32_e32 v1, v2, v3
15078 v_mul_i32_i24_e64 v1, v2, 3
15079 v_mul_i32_i24_e32 v1, -3, v3
15080 v_mul_i32_i24_e32 v1, -100, v3
15081 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
15082 v_max_f16_e32 v1, v2, v3
15086 .. code-block:: nasm
15088 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
15089 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15090 v_mov_b32 v0, v0 wave_shl:1
15091 v_mov_b32 v0, v0 row_mirror
15092 v_mov_b32 v0, v0 row_bcast:31
15093 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
15094 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15095 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15098 VOP3_DPP examples (Available on GFX11+):
15100 .. code-block:: nasm
15102 v_add_f32_e64_dpp v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7]
15103 v_sqrt_f32_e64_dpp v0, v1 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
15104 v_ldexp_f32 v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7]
15108 .. code-block:: nasm
15110 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
15111 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
15112 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
15113 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
15114 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
15116 For full list of supported instructions, refer to "Vector ALU instructions".
15118 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
15120 Code Object V2 Predefined Symbols
15121 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15124 Code object V2 generation is no longer supported by this version of LLVM.
15126 The AMDGPU assembler defines and updates some symbols automatically. These
15127 symbols do not affect code generation.
15129 .option.machine_version_major
15130 +++++++++++++++++++++++++++++
15132 Set to the GFX major generation number of the target being assembled for. For
15133 example, when assembling for a "GFX9" target this will be set to the integer
15134 value "9". The possible GFX major generation numbers are presented in
15135 :ref:`amdgpu-processors`.
15137 .option.machine_version_minor
15138 +++++++++++++++++++++++++++++
15140 Set to the GFX minor generation number of the target being assembled for. For
15141 example, when assembling for a "GFX810" target this will be set to the integer
15142 value "1". The possible GFX minor generation numbers are presented in
15143 :ref:`amdgpu-processors`.
15145 .option.machine_version_stepping
15146 ++++++++++++++++++++++++++++++++
15148 Set to the GFX stepping generation number of the target being assembled for.
15149 For example, when assembling for a "GFX704" target this will be set to the
15150 integer value "4". The possible GFX stepping generation numbers are presented
15151 in :ref:`amdgpu-processors`.
15156 Set to zero each time a
15157 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
15158 encountered. At each instruction, if the current value of this symbol is less
15159 than or equal to the maximum VGPR number explicitly referenced within that
15160 instruction then the symbol value is updated to equal that VGPR number plus
15166 Set to zero each time a
15167 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
15168 encountered. At each instruction, if the current value of this symbol is less
15169 than or equal to the maximum VGPR number explicitly referenced within that
15170 instruction then the symbol value is updated to equal that SGPR number plus
15173 .. _amdgpu-amdhsa-assembler-directives-v2:
15175 Code Object V2 Directives
15176 ~~~~~~~~~~~~~~~~~~~~~~~~~
15179 Code object V2 generation is no longer supported by this version of LLVM.
15181 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
15182 one can specify them with assembler directives.
15184 .hsa_code_object_version major, minor
15185 +++++++++++++++++++++++++++++++++++++
15187 *major* and *minor* are integers that specify the version of the HSA code
15188 object that will be generated by the assembler.
15190 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
15191 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15194 *major*, *minor*, and *stepping* are all integers that describe the instruction
15195 set architecture (ISA) version of the assembly program.
15197 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
15198 "AMD" and *arch* should always be equal to "AMDGPU".
15200 By default, the assembler will derive the ISA version, *vendor*, and *arch*
15201 from the value of the -mcpu option that is passed to the assembler.
15203 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
15205 .amdgpu_hsa_kernel (name)
15206 +++++++++++++++++++++++++
15208 This directives specifies that the symbol with given name is a kernel entry
15209 point (label) and the object should contain corresponding symbol of type
15210 STT_AMDGPU_HSA_KERNEL.
15215 This directive marks the beginning of a list of key / value pairs that are used
15216 to specify the amd_kernel_code_t object that will be emitted by the assembler.
15217 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
15218 amd_kernel_code_t values that are unspecified a default value will be used. The
15219 default value for all keys is 0, with the following exceptions:
15221 - *amd_code_version_major* defaults to 1.
15222 - *amd_kernel_code_version_minor* defaults to 2.
15223 - *amd_machine_kind* defaults to 1.
15224 - *amd_machine_version_major*, *machine_version_minor*, and
15225 *amd_machine_version_stepping* are derived from the value of the -mcpu option
15226 that is passed to the assembler.
15227 - *kernel_code_entry_byte_offset* defaults to 256.
15228 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
15229 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
15230 Note that wavefront size is specified as a power of two, so a value of **n**
15231 means a size of 2^ **n**.
15232 - *call_convention* defaults to -1.
15233 - *kernarg_segment_alignment*, *group_segment_alignment*, and
15234 *private_segment_alignment* default to 4. Note that alignments are specified
15235 as a power of 2, so a value of **n** means an alignment of 2^ **n**.
15236 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
15238 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
15240 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
15242 The *.amd_kernel_code_t* directive must be placed immediately after the
15243 function label and before any instructions.
15245 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
15246 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
15248 .. _amdgpu-amdhsa-assembler-example-v2:
15250 Code Object V2 Example Source Code
15251 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15254 Code object V2 generation is no longer supported by this version of LLVM.
15256 Here is an example of a minimal assembly source file, defining one HSA kernel:
15261 .hsa_code_object_version 1,0
15262 .hsa_code_object_isa
15267 .amdgpu_hsa_kernel hello_world
15272 enable_sgpr_kernarg_segment_ptr = 1
15274 compute_pgm_rsrc1_vgprs = 0
15275 compute_pgm_rsrc1_sgprs = 0
15276 compute_pgm_rsrc2_user_sgpr = 2
15277 compute_pgm_rsrc1_wgp_mode = 0
15278 compute_pgm_rsrc1_mem_ordered = 0
15279 compute_pgm_rsrc1_fwd_progress = 1
15280 .end_amd_kernel_code_t
15282 s_load_dwordx2 s[0:1], s[0:1] 0x0
15283 v_mov_b32 v0, 3.14159
15284 s_waitcnt lgkmcnt(0)
15287 flat_store_dword v[1:2], v0
15290 .size hello_world, .Lfunc_end0-hello_world
15292 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
15294 Code Object V3 and Above Predefined Symbols
15295 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15297 The AMDGPU assembler defines and updates some symbols automatically. These
15298 symbols do not affect code generation.
15300 .amdgcn.gfx_generation_number
15301 +++++++++++++++++++++++++++++
15303 Set to the GFX major generation number of the target being assembled for. For
15304 example, when assembling for a "GFX9" target this will be set to the integer
15305 value "9". The possible GFX major generation numbers are presented in
15306 :ref:`amdgpu-processors`.
15308 .amdgcn.gfx_generation_minor
15309 ++++++++++++++++++++++++++++
15311 Set to the GFX minor generation number of the target being assembled for. For
15312 example, when assembling for a "GFX810" target this will be set to the integer
15313 value "1". The possible GFX minor generation numbers are presented in
15314 :ref:`amdgpu-processors`.
15316 .amdgcn.gfx_generation_stepping
15317 +++++++++++++++++++++++++++++++
15319 Set to the GFX stepping generation number of the target being assembled for.
15320 For example, when assembling for a "GFX704" target this will be set to the
15321 integer value "4". The possible GFX stepping generation numbers are presented
15322 in :ref:`amdgpu-processors`.
15324 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
15326 .amdgcn.next_free_vgpr
15327 ++++++++++++++++++++++
15329 Set to zero before assembly begins. At each instruction, if the current value
15330 of this symbol is less than or equal to the maximum VGPR number explicitly
15331 referenced within that instruction then the symbol value is updated to equal
15332 that VGPR number plus one.
15334 May be used to set the `.amdhsa_next_free_vgpr` directive in
15335 :ref:`amdhsa-kernel-directives-table`.
15337 May be set at any time, e.g. manually set to zero at the start of each kernel.
15339 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
15341 .amdgcn.next_free_sgpr
15342 ++++++++++++++++++++++
15344 Set to zero before assembly begins. At each instruction, if the current value
15345 of this symbol is less than or equal the maximum SGPR number explicitly
15346 referenced within that instruction then the symbol value is updated to equal
15347 that SGPR number plus one.
15349 May be used to set the `.amdhsa_next_free_spgr` directive in
15350 :ref:`amdhsa-kernel-directives-table`.
15352 May be set at any time, e.g. manually set to zero at the start of each kernel.
15354 .. _amdgpu-amdhsa-assembler-directives-v3-onwards:
15356 Code Object V3 and Above Directives
15357 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15359 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
15360 architecture processors, and are not OS-specific. Directives which begin with
15361 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
15362 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
15363 :ref:`amdgpu-processors`.
15365 .. _amdgpu-assembler-directive-amdgcn-target:
15367 .amdgcn_target <target-triple> "-" <target-id>
15368 ++++++++++++++++++++++++++++++++++++++++++++++
15370 Optional directive which declares the ``<target-triple>-<target-id>`` supported
15371 by the containing assembler source file. Used by the assembler to validate
15372 command-line options such as ``-triple``, ``-mcpu``, and
15373 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
15374 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
15378 The target ID syntax used for code object V2 to V3 for this directive differs
15379 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
15381 .amdhsa_kernel <name>
15382 +++++++++++++++++++++
15384 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
15385 ``<name>.kd``, in the current location of the current section. Only valid when
15386 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
15387 instruction to execute, and does not need to be previously defined.
15389 Marks the beginning of a list of directives used to generate the bytes of a
15390 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
15391 Directives which may appear in this list are described in
15392 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
15393 be valid for the target being assembled for, and cannot be repeated. Directives
15394 support the range of values specified by the field they reference in
15395 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
15396 assumed to have its default value, unless it is marked as "Required", in which
15397 case it is an error to omit the directive. This list of directives is
15398 terminated by an ``.end_amdhsa_kernel`` directive.
15400 .. table:: AMDHSA Kernel Assembler Directives
15401 :name: amdhsa-kernel-directives-table
15403 ======================================================== =================== ============ ===================
15404 Directive Default Supported On Description
15405 ======================================================== =================== ============ ===================
15406 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX12 Controls GROUP_SEGMENT_FIXED_SIZE in
15407 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15408 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX12 Controls PRIVATE_SEGMENT_FIXED_SIZE in
15409 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15410 ``.amdhsa_kernarg_size`` 0 GFX6-GFX12 Controls KERNARG_SIZE in
15411 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15412 ``.amdhsa_user_sgpr_count`` 0 GFX6-GFX12 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
15413 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`
15414 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
15415 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15417 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_DISPATCH_PTR in
15418 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15419 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_QUEUE_PTR in
15420 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15421 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX12 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
15422 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15423 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX12 Controls ENABLE_SGPR_DISPATCH_ID in
15424 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15425 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
15426 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15428 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX12 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
15429 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15430 ``.amdhsa_wavefront_size32`` Target GFX10-GFX12 Controls ENABLE_WAVEFRONT_SIZE32 in
15431 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15434 ``.amdhsa_uses_dynamic_stack`` 0 GFX6-GFX12 Controls USES_DYNAMIC_STACK in
15435 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15436 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in
15437 (except :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15439 ``.amdhsa_enable_private_segment`` 0 GFX940, Controls ENABLE_PRIVATE_SEGMENT in
15440 GFX11-GFX12 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15441 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_X in
15442 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15443 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
15444 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15445 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
15446 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15447 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX12 Controls ENABLE_SGPR_WORKGROUP_INFO in
15448 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15449 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX12 Controls ENABLE_VGPR_WORKITEM_ID in
15450 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15451 Possible values are defined in
15452 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
15453 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX12 Maximum VGPR number explicitly referenced, plus one.
15454 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
15455 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15456 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX12 Maximum SGPR number explicitly referenced, plus one.
15457 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15458 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15459 ``.amdhsa_accum_offset`` Required GFX90A, Offset of a first AccVGPR in the unified register file.
15460 GFX940 Used to calculate ACCUM_OFFSET in
15461 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
15462 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX12 Whether the kernel may use the special VCC SGPR.
15463 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15464 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15465 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access
15466 (except scratch memory. Used to calculate
15467 GFX940) GRANULATED_WAVEFRONT_SGPR_COUNT in
15468 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15469 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay.
15470 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15471 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15473 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX12 Controls FLOAT_ROUND_MODE_32 in
15474 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15475 Possible values are defined in
15476 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15477 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX12 Controls FLOAT_ROUND_MODE_16_64 in
15478 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15479 Possible values are defined in
15480 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15481 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX12 Controls FLOAT_DENORM_MODE_32 in
15482 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15483 Possible values are defined in
15484 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15485 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX12 Controls FLOAT_DENORM_MODE_16_64 in
15486 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15487 Possible values are defined in
15488 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15489 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX11 Controls ENABLE_DX10_CLAMP in
15490 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15491 ``.amdhsa_ieee_mode`` 1 GFX6-GFX11 Controls ENABLE_IEEE_MODE in
15492 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15493 ``.amdhsa_round_robin_scheduling`` 0 GFX12 Controls ENABLE_WG_RR_EN in
15494 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15495 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX12 Controls FP16_OVFL in
15496 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15497 ``.amdhsa_tg_split`` Target GFX90A, Controls TG_SPLIT in
15498 Feature GFX940, :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
15499 Specific GFX11-GFX12
15501 ``.amdhsa_workgroup_processor_mode`` Target GFX10-GFX12 Controls ENABLE_WGP_MODE in
15502 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15505 ``.amdhsa_memory_ordered`` 1 GFX10-GFX12 Controls MEM_ORDERED in
15506 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15507 ``.amdhsa_forward_progress`` 0 GFX10-GFX12 Controls FWD_PROGRESS in
15508 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table`.
15509 ``.amdhsa_shared_vgpr_count`` 0 GFX10-GFX11 Controls SHARED_VGPR_COUNT in
15510 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx12-table`.
15511 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
15512 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15513 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
15514 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15515 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
15516 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15517 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
15518 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15519 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
15520 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15521 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
15522 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15523 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX12 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
15524 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx12-table`.
15525 ``.amdhsa_user_sgpr_kernarg_preload_length`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_LENGTH in
15526 GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15527 ``.amdhsa_user_sgpr_kernarg_preload_offset`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_OFFSET in
15528 GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15529 ======================================================== =================== ============ ===================
15534 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
15535 note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
15537 The contents must be in the [YAML]_ markup format, with the same structure and
15538 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
15539 :ref:`amdgpu-amdhsa-code-object-metadata-v4` or
15540 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
15542 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
15544 .. _amdgpu-amdhsa-assembler-example-v3-onwards:
15546 Code Object V3 and Above Example Source Code
15547 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15549 Here is an example of a minimal assembly source file, defining one HSA kernel:
15554 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15559 .type hello_world,@function
15561 s_load_dwordx2 s[0:1], s[0:1] 0x0
15562 v_mov_b32 v0, 3.14159
15563 s_waitcnt lgkmcnt(0)
15566 flat_store_dword v[1:2], v0
15569 .size hello_world, .Lfunc_end0-hello_world
15573 .amdhsa_kernel hello_world
15574 .amdhsa_user_sgpr_kernarg_segment_ptr 1
15575 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15576 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15585 - .name: hello_world
15586 .symbol: hello_world.kd
15587 .kernarg_segment_size: 48
15588 .group_segment_fixed_size: 0
15589 .private_segment_fixed_size: 0
15590 .kernarg_segment_align: 4
15591 .wavefront_size: 64
15594 .max_flat_workgroup_size: 256
15598 .value_kind: global_buffer
15599 .address_space: global
15600 .actual_access: write_only
15602 .end_amdgpu_metadata
15604 This kernel is equivalent to the following HIP program:
15609 __global__ void hello_world(float *p) {
15613 If an assembly source file contains multiple kernels and/or functions, the
15614 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
15615 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
15616 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
15617 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
15618 to group the function with the kernel that calls it and reset the symbols
15619 between the two connected components:
15624 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15626 // gpr tracking symbols are implicitly set to zero
15631 .type kern0,@function
15636 .size kern0, .Lkern0_end-kern0
15640 .amdhsa_kernel kern0
15642 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15643 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15646 // reset symbols to begin tracking usage in func1 and kern1
15647 .set .amdgcn.next_free_vgpr, 0
15648 .set .amdgcn.next_free_sgpr, 0
15654 .type func1,@function
15657 s_setpc_b64 s[30:31]
15659 .size func1, .Lfunc1_end-func1
15663 .type kern1,@function
15667 s_add_u32 s4, s4, func1@rel32@lo+4
15668 s_addc_u32 s5, s5, func1@rel32@lo+4
15669 s_swappc_b64 s[30:31], s[4:5]
15673 .size kern1, .Lkern1_end-kern1
15677 .amdhsa_kernel kern1
15679 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15680 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15683 These symbols cannot identify connected components in order to automatically
15684 track the usage for each kernel. However, in some cases careful organization of
15685 the kernels and functions in the source file means there is minimal additional
15686 effort required to accurately calculate GPR usage.
15688 Additional Documentation
15689 ========================
15691 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
15692 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
15693 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
15694 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
15695 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
15696 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
15697 .. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
15698 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
15699 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
15700 .. [AMD-GCN-GFX11-RDNA3] `AMD RDNA 3 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf>`__
15701 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
15702 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
15703 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
15704 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
15705 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
15706 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
15707 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
15708 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
15709 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
15710 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
15711 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
15712 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
15713 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
15714 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
15715 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
15716 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__