1 =============================
2 User Guide for AMDGPU Backend
3 =============================
14 AMDGPU/AMDGPUAsmGFX900
15 AMDGPU/AMDGPUAsmGFX904
16 AMDGPU/AMDGPUAsmGFX906
17 AMDGPU/AMDGPUAsmGFX908
18 AMDGPU/AMDGPUAsmGFX90a
19 AMDGPU/AMDGPUAsmGFX940
21 AMDGPU/AMDGPUAsmGFX1011
22 AMDGPU/AMDGPUAsmGFX1013
23 AMDGPU/AMDGPUAsmGFX1030
27 AMDGPUInstructionSyntax
28 AMDGPUInstructionNotation
29 AMDGPUDwarfExtensionsForHeterogeneousDebugging
30 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
35 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
36 R600 family up until the current GCN families. It lives in the
37 ``llvm/lib/Target/AMDGPU`` directory.
42 .. _amdgpu-target-triples:
47 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
48 to specify the target triple:
50 .. table:: AMDGPU Architectures
51 :name: amdgpu-architecture-table
53 ============ ==============================================================
54 Architecture Description
55 ============ ==============================================================
56 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
57 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
58 ============ ==============================================================
60 .. table:: AMDGPU Vendors
61 :name: amdgpu-vendor-table
63 ============ ==============================================================
65 ============ ==============================================================
66 ``amd`` Can be used for all AMD GPU usage.
67 ``mesa3d`` Can be used if the OS is ``mesa3d``.
68 ============ ==============================================================
70 .. table:: AMDGPU Operating Systems
73 ============== ============================================================
75 ============== ============================================================
76 *<empty>* Defaults to the *unknown* OS.
77 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
80 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
81 loader on Linux. See *AMD ROCm Platform Release Notes*
82 [AMD-ROCm-Release-Notes]_ for supported hardware and
84 - AMD's PAL runtime using the *pal-amdhsa* loader on
87 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL
88 runtime using the *pal-amdpal* loader on Windows and Linux
90 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa
91 3D runtime using the *mesa-mesa3d* loader on Linux.
92 ============== ============================================================
94 .. table:: AMDGPU Environments
95 :name: amdgpu-environment-table
97 ============ ==============================================================
98 Environment Description
99 ============ ==============================================================
101 ============ ==============================================================
103 .. _amdgpu-processors:
108 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
109 specify the AMDGPU processor together with optional target features. See
110 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
111 specific information.
113 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
115 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
118 .. table:: AMDGPU Processors
119 :name: amdgpu-processor-table
121 =========== =============== ============ ===== ================= =============== =============== ======================
122 Processor Alternative Target dGPU/ Target Target OS Support Example
123 Processor Triple APU Features Properties *(see* Products
124 Architecture Supported `amdgpu-os`_
133 =========== =============== ============ ===== ================= =============== =============== ======================
134 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
135 -----------------------------------------------------------------------------------------------------------------------
136 ``r600`` ``r600`` dGPU - Does not
141 ``r630`` ``r600`` dGPU - Does not
146 ``rs880`` ``r600`` dGPU - Does not
151 ``rv670`` ``r600`` dGPU - Does not
156 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
157 -----------------------------------------------------------------------------------------------------------------------
158 ``rv710`` ``r600`` dGPU - Does not
163 ``rv730`` ``r600`` dGPU - Does not
168 ``rv770`` ``r600`` dGPU - Does not
173 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
174 -----------------------------------------------------------------------------------------------------------------------
175 ``cedar`` ``r600`` dGPU - Does not
180 ``cypress`` ``r600`` dGPU - Does not
185 ``juniper`` ``r600`` dGPU - Does not
190 ``redwood`` ``r600`` dGPU - Does not
195 ``sumo`` ``r600`` dGPU - Does not
200 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
201 -----------------------------------------------------------------------------------------------------------------------
202 ``barts`` ``r600`` dGPU - Does not
207 ``caicos`` ``r600`` dGPU - Does not
212 ``cayman`` ``r600`` dGPU - Does not
217 ``turks`` ``r600`` dGPU - Does not
222 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
223 -----------------------------------------------------------------------------------------------------------------------
224 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
229 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
234 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
239 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
240 -----------------------------------------------------------------------------------------------------------------------
241 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000
242 flat - *pal-amdhsa* - A6 Pro-7050B
243 scratch - *pal-amdpal* - A8-7100
251 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100
252 flat - *pal-amdhsa* - FirePro W9100
253 scratch - *pal-amdpal* - FirePro S9150
255 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290
256 flat - *pal-amdhsa* - Radeon R9 290x
257 scratch - *pal-amdpal* - Radeon R390
259 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100
260 - ``mullins`` flat - *pal-amdpal* - E1-2200
268 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790
269 flat - *pal-amdpal* - Radeon HD 8770
272 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA*
279 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
280 -----------------------------------------------------------------------------------------------------------------------
281 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P
282 flat - *pal-amdhsa* - Pro A6-8500B
283 scratch - *pal-amdpal* - A8-8600P
299 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285
300 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380
301 scratch - *pal-amdpal* - Radeon R9 385
302 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano
303 - *pal-amdhsa* - Radeon R9 Fury
304 - *pal-amdpal* - Radeon R9 FuryX
307 - Radeon Instinct MI8
308 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470
309 flat - *pal-amdhsa* - Radeon RX 480
310 scratch - *pal-amdpal* - Radeon Instinct MI6
311 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460
313 scratch - *pal-amdpal*
314 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150
315 flat - *pal-amdhsa* - FirePro S7100
316 scratch - *pal-amdpal* - FirePro W7100
319 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA*
321 scratch - *pal-amdpal* .. TODO::
326 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_
327 -----------------------------------------------------------------------------------------------------------------------
328 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega
329 flat - *pal-amdhsa* Frontier Edition
330 scratch - *pal-amdpal* - Radeon RX Vega 56
334 - Radeon Instinct MI25
335 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G
336 flat - *pal-amdhsa* - Ryzen 5 2400G
337 scratch - *pal-amdpal*
338 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA*
340 - *pal-amdpal* .. TODO::
345 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50
346 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60
347 scratch - *pal-amdpal* - Radeon VII
349 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
353 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA*
360 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA*
362 - xnack scratch .. TODO::
363 - kernarg preload - Packed
364 work-item Add product
367 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G
368 flat - Ryzen 7 4700GE
369 scratch - Ryzen 5 4600G
381 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
383 - xnack scratch .. TODO::
384 - kernarg preload - Packed
385 work-item Add product
388 ``gfx941`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
390 - xnack scratch .. TODO::
391 - kernarg preload - Packed
392 work-item Add product
395 ``gfx942`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
397 - xnack scratch .. TODO::
398 - kernarg preload - Packed
399 work-item Add product
402 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
403 -----------------------------------------------------------------------------------------------------------------------
404 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700
405 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT
406 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT
408 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520
409 - wavefrontsize64 - Absolute - *pal-amdhsa*
410 - xnack flat - *pal-amdpal*
412 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500
413 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT
414 - xnack scratch - *pal-amdpal*
415 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA*
416 - wavefrontsize64 flat - *pal-amdhsa*
417 - xnack scratch - *pal-amdpal* .. TODO::
422 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
423 -----------------------------------------------------------------------------------------------------------------------
424 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800
425 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT
426 scratch - *pal-amdpal* - Radeon RX 6900 XT
427 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT
428 - wavefrontsize64 flat - *pal-amdhsa*
429 scratch - *pal-amdpal*
430 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA*
431 - wavefrontsize64 flat - *pal-amdhsa*
432 scratch - *pal-amdpal* .. TODO::
437 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
438 - wavefrontsize64 flat
443 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA*
444 - wavefrontsize64 flat
450 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
451 - wavefrontsize64 flat
456 ``gfx1036`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
457 - wavefrontsize64 flat
463 **GCN GFX11 (RDNA 3)** [AMD-GCN-GFX11-RDNA3]_
464 -----------------------------------------------------------------------------------------------------------------------
465 ``gfx1100`` ``amdgcn`` dGPU - cumode - Architected - *pal-amdpal* *TBA*
466 - wavefrontsize64 flat
469 work-item Add product
472 ``gfx1101`` ``amdgcn`` dGPU - cumode - Architected *TBA*
473 - wavefrontsize64 flat
476 work-item Add product
479 ``gfx1102`` ``amdgcn`` dGPU - cumode - Architected *TBA*
480 - wavefrontsize64 flat
483 work-item Add product
486 ``gfx1103`` ``amdgcn`` APU - cumode - Architected *TBA*
487 - wavefrontsize64 flat
490 work-item Add product
493 ``gfx1150`` ``amdgcn`` APU - cumode - Architected *TBA*
494 - wavefrontsize64 flat
497 work-item Add product
500 ``gfx1151`` ``amdgcn`` APU - cumode - Architected *TBA*
501 - wavefrontsize64 flat
504 work-item Add product
507 =========== =============== ============ ===== ================= =============== =============== ======================
509 .. _amdgpu-target-features:
514 Target features control how code is generated to support certain
515 processor specific features. Not all target features are supported by
516 all processors. The runtime must ensure that the features supported by
517 the device used to execute the code match the features enabled when
518 generating the code. A mismatch of features may result in incorrect
519 execution, or a reduction in performance.
521 The target features supported by each processor is listed in
522 :ref:`amdgpu-processor-table`.
524 Target features are controlled by exactly one of the following Clang
527 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
529 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
530 optional components of the target ID. If omitted, the target feature has the
531 ``any`` value. See :ref:`amdgpu-target-id`.
533 ``-m[no-]<target-feature>``
535 Target features not specified by the target ID are specified using a
536 separate option. These target features can have an ``on`` or ``off``
537 value. ``on`` is specified by omitting the ``no-`` prefix, and
538 ``off`` is specified by including the ``no-`` prefix. The default
539 if not specified is ``off``.
543 ``-mcpu=gfx908:xnack+``
544 Enable the ``xnack`` feature.
545 ``-mcpu=gfx908:xnack-``
546 Disable the ``xnack`` feature.
548 Enable the ``cumode`` feature.
550 Disable the ``cumode`` feature.
552 .. table:: AMDGPU Target Features
553 :name: amdgpu-target-features-table
555 =============== ============================ ==================================================
556 Target Feature Clang Option to Control Description
558 =============== ============================ ==================================================
559 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
560 when generating code for kernels. When disabled
561 native WGP wavefront execution mode is used,
562 when enabled CU wavefront execution mode is used
563 (see :ref:`amdgpu-amdhsa-memory-model`).
565 sramecc - ``-mcpu`` If specified, generate code that can only be
566 - ``--offload-arch`` loaded and executed in a process that has a
567 matching setting for SRAMECC.
569 If not specified for code object V2 to V3, generate
570 code that can be loaded and executed in a process
571 with SRAMECC enabled.
573 If not specified for code object V4 or above, generate
574 code that can be loaded and executed in a process
575 with either setting of SRAMECC.
577 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
578 work-groups are launched in threadgroup split mode.
579 When enabled the waves of a work-group may be
580 launched in different CUs.
582 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
583 generating code for kernels. When disabled
584 native wavefront size 32 is used, when enabled
585 wavefront size 64 is used.
587 xnack - ``-mcpu`` If specified, generate code that can only be
588 - ``--offload-arch`` loaded and executed in a process that has a
589 matching setting for XNACK replay.
591 If not specified for code object V2 to V3, generate
592 code that can be loaded and executed in a process
593 with XNACK replay enabled.
595 If not specified for code object V4 or above, generate
596 code that can be loaded and executed in a process
597 with either setting of XNACK replay.
599 XNACK replay can be used for demand paging and
600 page migration. If enabled in the device, then if
601 a page fault occurs the code may execute
602 incorrectly unless generated with XNACK replay
603 enabled, or generated for code object V4 or above without
604 specifying XNACK replay. Executing code that was
605 generated with XNACK replay enabled, or generated
606 for code object V4 or above without specifying XNACK replay,
607 on a device that does not have XNACK replay
608 enabled will execute correctly but may be less
609 performant than code generated for XNACK replay
611 =============== ============================ ==================================================
613 .. _amdgpu-target-id:
618 AMDGPU supports target IDs. See `Clang Offload Bundler
619 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
620 description. The AMDGPU target specific information is:
623 Is an AMDGPU processor or alternative processor name specified in
624 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
625 the primary processor and alternative processor names. The canonical form
626 target ID only allow the primary processor name.
629 Is a target feature name specified in :ref:`amdgpu-target-features-table` that
630 is supported by the processor. The target features supported by each processor
631 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
632 a target ID are marked as being controlled by ``-mcpu`` and
633 ``--offload-arch``. Each target feature must appear at most once in a target
634 ID. The non-canonical form target ID allows the target features to be
635 specified in any order. The canonical form target ID requires the target
636 features to be specified in alphabetic order.
638 .. _amdgpu-target-id-v2-v3:
640 Code Object V2 to V3 Target ID
641 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
643 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
644 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
645 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
646 directive and the bundle entry ID. In those cases it has the following BNF
651 <target-id> ::== <processor> ( "+" <target-feature> )*
653 Where a target feature is omitted if *Off* and present if *On* or *Any*.
657 The code object V2 to V3 cannot represent *Any* and treats it the same as
660 .. _amdgpu-embedding-bundled-objects:
662 Embedding Bundled Code Objects
663 ------------------------------
665 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
666 as described in `Clang Offload Bundler
667 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
671 The target ID syntax used for code object V2 to V3 for a bundle entry ID
672 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
674 .. _amdgpu-address-spaces:
679 The AMDGPU architecture supports a number of memory address spaces. The address
680 space names use the OpenCL standard names, with some additions.
682 The AMDGPU address spaces correspond to target architecture specific LLVM
683 address space numbers used in LLVM IR.
685 The AMDGPU address spaces are described in
686 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
687 supported for the ``amdgcn`` target.
689 .. table:: AMDGPU Address Spaces
690 :name: amdgpu-address-spaces-table
692 ================================= =============== =========== ================ ======= ============================
693 .. 64-Bit Process Address Space
694 --------------------------------- --------------- ----------- ---------------- ------------------------------------
695 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value
696 Space Number Name Name Size
697 ================================= =============== =========== ================ ======= ============================
698 Generic 0 flat flat 64 0x0000000000000000
699 Global 1 global global 64 0x0000000000000000
700 Region 2 N/A GDS 32 *not implemented for AMDHSA*
701 Local 3 group LDS 32 0xFFFFFFFF
702 Constant 4 constant *same as global* 64 0x0000000000000000
703 Private 5 private scratch 32 0xFFFFFFFF
704 Constant 32-bit 6 *TODO* 0x00000000
705 Buffer Fat Pointer (experimental) 7 *TODO*
706 Buffer Resource (experimental) 8 *TODO*
707 Streamout Registers 128 N/A GS_REGS
708 ================================= =============== =========== ================ ======= ============================
711 The generic address space is supported unless the *Target Properties* column
712 of :ref:`amdgpu-processor-table` specifies *Does not support generic address
715 The generic address space uses the hardware flat address support for two fixed
716 ranges of virtual addresses (the private and local apertures), that are
717 outside the range of addressable global memory, to map from a flat address to
718 a private or local address. This uses FLAT instructions that can take a flat
719 address and access global, private (scratch), and group (LDS) memory depending
720 on if the address is within one of the aperture ranges.
722 Flat access to scratch requires hardware aperture setup and setup in the
723 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
724 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
725 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
727 To convert between a private or group address space address (termed a segment
728 address) and a flat address the base address of the corresponding aperture
729 can be used. For GFX7-GFX8 these are available in the
730 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
731 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
732 GFX9-GFX11 the aperture base addresses are directly available as inline
733 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
734 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
735 aligned to 2^32 which makes it easier to convert from flat to segment or
738 A global address space address has the same value when used as a flat address
739 so no conversion is needed.
741 **Global and Constant**
742 The global and constant address spaces both use global virtual addresses,
743 which are the same virtual address space used by the CPU. However, some
744 virtual addresses may only be accessible to the CPU, some only accessible
745 by the GPU, and some by both.
747 Using the constant address space indicates that the data will not change
748 during the execution of the kernel. This allows scalar read instructions to
749 be used. As the constant address space could only be modified on the host
750 side, a generic pointer loaded from the constant address space is safe to be
751 assumed as a global pointer since only the device global memory is visible
752 and managed on the host side. The vector and scalar L1 caches are invalidated
753 of volatile data before each kernel dispatch execution to allow constant
754 memory to change values between kernel dispatches.
757 The region address space uses the hardware Global Data Store (GDS). All
758 wavefronts executing on the same device will access the same memory for any
759 given region address. However, the same region address accessed by wavefronts
760 executing on different devices will access different memory. It is higher
761 performance than global memory. It is allocated by the runtime. The data
762 store (DS) instructions can be used to access it.
765 The local address space uses the hardware Local Data Store (LDS) which is
766 automatically allocated when the hardware creates the wavefronts of a
767 work-group, and freed when all the wavefronts of a work-group have
768 terminated. All wavefronts belonging to the same work-group will access the
769 same memory for any given local address. However, the same local address
770 accessed by wavefronts belonging to different work-groups will access
771 different memory. It is higher performance than global memory. The data store
772 (DS) instructions can be used to access it.
775 The private address space uses the hardware scratch memory support which
776 automatically allocates memory when it creates a wavefront and frees it when
777 a wavefronts terminates. The memory accessed by a lane of a wavefront for any
778 given private address will be different to the memory accessed by another lane
779 of the same or different wavefront for the same private address.
781 If a kernel dispatch uses scratch, then the hardware allocates memory from a
782 pool of backing memory allocated by the runtime for each wavefront. The lanes
783 of the wavefront access this using dword (4 byte) interleaving. The mapping
784 used from private address to backing memory address is:
786 ``wavefront-scratch-base +
787 ((private-address / 4) * wavefront-size * 4) +
788 (wavefront-lane-id * 4) + (private-address % 4)``
790 If each lane of a wavefront accesses the same private address, the
791 interleaving results in adjacent dwords being accessed and hence requires
792 fewer cache lines to be fetched.
794 There are different ways that the wavefront scratch base address is
795 determined by a wavefront (see
796 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
798 Scratch memory can be accessed in an interleaved manner using buffer
799 instructions with the scratch buffer descriptor and per wavefront scratch
800 offset, by the scratch instructions, or by flat instructions. Multi-dword
801 access is not supported except by flat and scratch instructions in
804 Code that manipulates the stack values in other lanes of a wavefront,
805 such as by ``addrspacecast``-ing stack pointers to generic ones and taking offsets
806 that reach other lanes or by explicitly constructing the scratch buffer descriptor,
807 triggers undefined behavior when it modifies the scratch values of other lanes.
808 The compiler may assume that such modifications do not occur.
813 **Buffer Fat Pointer**
814 The buffer fat pointer is an experimental address space that is currently
815 unsupported in the backend. It exposes a non-integral pointer that is in
816 the future intended to support the modelling of 128-bit buffer descriptors
817 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
818 *pointer*), allowing normal LLVM load/store/atomic operations to be used to
819 model the buffer descriptors used heavily in graphics workloads targeting
822 The buffer descriptor used to construct a buffer fat pointer must be *raw*:
823 the stride must be 0, the "add tid" flag bust be 0, the swizzle enable bits
824 must be off, and the extent must be measured in bytes. (On subtargets where
825 bounds checking may be disabled, buffer fat pointers may choose to enable
829 The buffer resource pointer, in address space 8, is the newer form
830 for representing buffer descriptors in AMDGPU IR, replacing their
831 previous representation as `<4 x i32>`. It is a non-integral pointer
832 that represents a 128-bit buffer descriptor resource (`V#`).
834 Since, in general, a buffer resource supports complex addressing modes that cannot
835 be easily represented in LLVM (such as implicit swizzled access to structured
836 buffers), it is **illegal** to perform non-trivial address computations, such as
837 ``getelementptr`` operations, on buffer resources. They may be passed to
838 AMDGPU buffer intrinsics, and they may be converted to and from ``i128``.
840 Casting a buffer resource to a buffer fat pointer is permitted and adds an offset
843 Buffer resources can be created from 64-bit pointers (which should be either
844 generic or global) using the `llvm.amdgcn.make.buffer.rsrc` intrinsic, which
845 takes the pointer, which becomes the base of the resource,
846 the 16-bit stride (and swzizzle control) field stored in bits `63:48` of a `V#`,
847 the 32-bit NumRecords/extent field (bits `95:64`), and the 32-bit flags field
848 (bits `127:96`). The specific interpretation of these fields varies by the
849 target architecture and is detailed in the ISA descriptions.
851 **Streamout Registers**
852 Dedicated registers used by the GS NGG Streamout Instructions. The register
853 file is modelled as a memory in a distinct address space because it is indexed
854 by an address-like offset in place of named registers, and because register
855 accesses affect LGKMcnt. This is an internal address space used only by the
856 compiler. Do not use this address space for IR pointers.
858 .. _amdgpu-memory-scopes:
863 This section provides LLVM memory synchronization scopes supported by the AMDGPU
864 backend memory model when the target triple OS is ``amdhsa`` (see
865 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
867 The memory model supported is based on the HSA memory model [HSA]_ which is
868 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
869 relation is transitive over the synchronizes-with relation independent of scope
870 and synchronizes-with allows the memory scope instances to be inclusive (see
871 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
873 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
874 inclusion and requires the memory scopes to exactly match. However, this
875 is conservatively correct for OpenCL.
877 .. table:: AMDHSA LLVM Sync Scopes
878 :name: amdgpu-amdhsa-llvm-sync-scopes-table
880 ======================= ===================================================
881 LLVM Sync Scope Description
882 ======================= ===================================================
883 *none* The default: ``system``.
885 Synchronizes with, and participates in modification
886 and seq_cst total orderings with, other operations
887 (except image operations) for all address spaces
888 (except private, or generic that accesses private)
889 provided the other operation's sync scope is:
892 - ``agent`` and executed by a thread on the same
894 - ``workgroup`` and executed by a thread in the
896 - ``wavefront`` and executed by a thread in the
899 ``agent`` Synchronizes with, and participates in modification
900 and seq_cst total orderings with, other operations
901 (except image operations) for all address spaces
902 (except private, or generic that accesses private)
903 provided the other operation's sync scope is:
905 - ``system`` or ``agent`` and executed by a thread
907 - ``workgroup`` and executed by a thread in the
909 - ``wavefront`` and executed by a thread in the
912 ``workgroup`` Synchronizes with, and participates in modification
913 and seq_cst total orderings with, other operations
914 (except image operations) for all address spaces
915 (except private, or generic that accesses private)
916 provided the other operation's sync scope is:
918 - ``system``, ``agent`` or ``workgroup`` and
919 executed by a thread in the same work-group.
920 - ``wavefront`` and executed by a thread in the
923 ``wavefront`` Synchronizes with, and participates in modification
924 and seq_cst total orderings with, other operations
925 (except image operations) for all address spaces
926 (except private, or generic that accesses private)
927 provided the other operation's sync scope is:
929 - ``system``, ``agent``, ``workgroup`` or
930 ``wavefront`` and executed by a thread in the
933 ``singlethread`` Only synchronizes with and participates in
934 modification and seq_cst total orderings with,
935 other operations (except image operations) running
936 in the same thread for all address spaces (for
937 example, in signal handlers).
939 ``one-as`` Same as ``system`` but only synchronizes with other
940 operations within the same address space.
942 ``agent-one-as`` Same as ``agent`` but only synchronizes with other
943 operations within the same address space.
945 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with
946 other operations within the same address space.
948 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with
949 other operations within the same address space.
951 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
952 other operations within the same address space.
953 ======================= ===================================================
958 The AMDGPU backend implements the following LLVM IR intrinsics.
960 *This section is WIP.*
962 .. table:: AMDGPU LLVM IR Intrinsics
963 :name: amdgpu-llvm-ir-intrinsics-table
965 ============================================== ==========================================================
966 LLVM Intrinsic Description
967 ============================================== ==========================================================
968 llvm.amdgcn.sqrt Provides direct access to v_sqrt_f64, v_sqrt_f32 and v_sqrt_f16
969 (on targets with half support). Performs sqrt function.
971 llvm.amdgcn.log Provides direct access to v_log_f32 and v_log_f16
972 (on targets with half support). Performs log2 function.
974 llvm.amdgcn.exp2 Provides direct access to v_exp_f32 and v_exp_f16
975 (on targets with half support). Performs exp2 function.
977 :ref:`llvm.frexp <int_frexp>` Implemented for half, float and double.
979 :ref:`llvm.log2 <int_log2>` Implemented for float and half (and vectors of float or
980 half). Not implemented for double. Hardware provides
981 1ULP accuracy for float, and 0.51ULP for half. Float
982 instruction does not natively support denormal
985 :ref:`llvm.sqrt <int_sqrt>` Implemented for double, float and half (and vectors).
987 :ref:`llvm.log <int_log>` Implemented for float and half (and vectors).
989 :ref:`llvm.exp <int_exp>` Implemented for float and half (and vectors).
991 :ref:`llvm.log10 <int_log10>` Implemented for float and half (and vectors).
993 :ref:`llvm.exp2 <int_exp2>` Implemented for float and half (and vectors of float or
994 half). Not implemented for double. Hardware provides
995 1ULP accuracy for float, and 0.51ULP for half. Float
996 instruction does not natively support denormal
999 :ref:`llvm.stacksave.p5 <int_stacksave>` Implemented, must use the alloca address space.
1000 :ref:`llvm.stackrestore.p5 <int_stackrestore>` Implemented, must use the alloca address space.
1002 :ref:`llvm.get.fpmode.i32 <int_get_fpmode>` The natural floating-point mode type is i32. This
1003 implemented by extracting relevant bits out of the MODE
1004 register with s_getreg_b32. The first 10 bits are the
1005 core floating-point mode. Bits 12:18 are the exception
1006 mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not
1007 relevant to floating-point instructions are 0s.
1009 :ref:`llvm.get.rounding<int_get_rounding>` AMDGPU supports two separately controllable rounding
1010 modes depending on the floating-point type. One
1011 controls float, and the other controls both double and
1012 half operations. If both modes are the same, returns
1013 one of the standard return values. If the modes are
1014 different, returns one of :ref:`12 extended values
1015 <amdgpu-rounding-mode-enumeration-values-table>`
1016 describing the two modes.
1018 To nearest, ties away from zero is not a supported
1019 mode. The raw rounding mode values in the MODE
1020 register do not exactly match the FLT_ROUNDS values,
1021 so a conversion is performed.
1023 llvm.amdgcn.wave.reduce.umin Performs an arithmetic unsigned min reduction on the unsigned values
1024 provided by each lane in the wavefront.
1025 Intrinsic takes a hint for reduction strategy using second operand
1026 0: Target default preference,
1027 1: `Iterative strategy`, and
1029 If target does not support the DPP operations (e.g. gfx6/7),
1030 reduction will be performed using default iterative strategy.
1031 Intrinsic is currently only implemented for i32.
1033 llvm.amdgcn.wave.reduce.umax Performs an arithmetic unsigned max reduction on the unsigned values
1034 provided by each lane in the wavefront.
1035 Intrinsic takes a hint for reduction strategy using second operand
1036 0: Target default preference,
1037 1: `Iterative strategy`, and
1039 If target does not support the DPP operations (e.g. gfx6/7),
1040 reduction will be performed using default iterative strategy.
1041 Intrinsic is currently only implemented for i32.
1043 llvm.amdgcn.udot2 Provides direct access to v_dot2_u32_u16 across targets which
1044 support such instructions. This performs unsigned dot product
1045 with two v2i16 operands, summed with the third i32 operand. The
1046 i1 fourth operand is used to clamp the output.
1048 llvm.amdgcn.udot4 Provides direct access to v_dot4_u32_u8 across targets which
1049 support such instructions. This performs unsigned dot product
1050 with two i32 operands (holding a vector of 4 8bit values), summed
1051 with the third i32 operand. The i1 fourth operand is used to clamp
1054 llvm.amdgcn.udot8 Provides direct access to v_dot8_u32_u4 across targets which
1055 support such instructions. This performs unsigned dot product
1056 with two i32 operands (holding a vector of 8 4bit values), summed
1057 with the third i32 operand. The i1 fourth operand is used to clamp
1060 llvm.amdgcn.sdot2 Provides direct access to v_dot2_i32_i16 across targets which
1061 support such instructions. This performs signed dot product
1062 with two v2i16 operands, summed with the third i32 operand. The
1063 i1 fourth operand is used to clamp the output.
1064 When applicable (e.g. no clamping), this is lowered into
1065 v_dot2c_i32_i16 for targets which support it.
1067 llvm.amdgcn.sdot4 Provides direct access to v_dot4_i32_i8 across targets which
1068 support such instructions. This performs signed dot product
1069 with two i32 operands (holding a vector of 4 8bit values), summed
1070 with the third i32 operand. The i1 fourth operand is used to clamp
1072 When applicable (i.e. no clamping / operand modifiers), this is lowered
1073 into v_dot4c_i32_i8 for targets which support it.
1074 RDNA3 does not offer v_dot4_i32_i8, and rather offers
1075 v_dot4_i32_iu8 which has operands to hold the signedness of the
1076 vector operands. Thus, this intrinsic lowers to the signed version
1077 of this instruction for gfx11 targets.
1079 llvm.amdgcn.sdot8 Provides direct access to v_dot8_u32_u4 across targets which
1080 support such instructions. This performs signed dot product
1081 with two i32 operands (holding a vector of 8 4bit values), summed
1082 with the third i32 operand. The i1 fourth operand is used to clamp
1084 When applicable (i.e. no clamping / operand modifiers), this is lowered
1085 into v_dot8c_i32_i4 for targets which support it.
1086 RDNA3 does not offer v_dot8_i32_i4, and rather offers
1087 v_dot4_i32_iu4 which has operands to hold the signedness of the
1088 vector operands. Thus, this intrinsic lowers to the signed version
1089 of this instruction for gfx11 targets.
1091 llvm.amdgcn.sudot4 Provides direct access to v_dot4_i32_iu8 on gfx11 targets. This performs
1092 dot product with two i32 operands (holding a vector of 4 8bit values), summed
1093 with the fifth i32 operand. The i1 sixth operand is used to clamp
1094 the output. The i1s preceding the vector operands decide the signedness.
1096 llvm.amdgcn.sudot8 Provides direct access to v_dot8_i32_iu4 on gfx11 targets. This performs
1097 dot product with two i32 operands (holding a vector of 8 4bit values), summed
1098 with the fifth i32 operand. The i1 sixth operand is used to clamp
1099 the output. The i1s preceding the vector operands decide the signedness.
1102 ============================================== ==========================================================
1106 List AMDGPU intrinsics.
1111 The AMDGPU backend supports the following LLVM IR attributes.
1113 .. table:: AMDGPU LLVM IR Attributes
1114 :name: amdgpu-llvm-ir-attributes-table
1116 ======================================= ==========================================================
1117 LLVM Attribute Description
1118 ======================================= ==========================================================
1119 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
1120 will be specified when the kernel is dispatched. Generated
1121 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
1122 The IR implied default value is 1,1024. Clang may emit this attribute
1123 with more restrictive bounds depending on language defaults.
1124 If the actual block or workgroup size exceeds the limit at any point during
1125 the execution, the behavior is undefined. For example, even if there is
1126 only one active thread but the thread local id exceeds the limit, the
1127 behavior is undefined.
1129 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
1130 argument block size for the implicit arguments. This
1131 varies by OS and language (for OpenCL see
1132 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
1133 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
1134 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
1135 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
1136 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
1137 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
1138 execution unit. Generated by the ``amdgpu_waves_per_eu``
1139 CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
1140 and the backend may not be able to satisfy the request. If
1141 the specified range is incompatible with the function's
1142 "amdgpu-flat-work-group-size" value, the implied occupancy
1143 bounds by the workgroup size takes precedence.
1145 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the
1146 mode register to be set on entry. Overrides the default for
1147 the calling convention.
1148 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of
1149 the mode register to be set on entry. Overrides the default
1150 for the calling convention.
1152 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the
1153 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
1154 attribute, or reached through a call site marked with this attribute,
1155 the value returned by the intrinsic is undefined. The backend can
1156 generally infer this during code generation, so typically there is no
1157 benefit to frontends marking functions with this.
1159 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the
1160 llvm.amdgcn.workitem.id.y intrinsic.
1162 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the
1163 llvm.amdgcn.workitem.id.z intrinsic.
1165 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the
1166 llvm.amdgcn.workgroup.id.x intrinsic.
1168 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the
1169 llvm.amdgcn.workgroup.id.y intrinsic.
1171 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the
1172 llvm.amdgcn.workgroup.id.z intrinsic.
1174 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the
1175 llvm.amdgcn.dispatch.ptr intrinsic.
1177 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the
1178 llvm.amdgcn.implicitarg.ptr intrinsic.
1180 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the
1181 llvm.amdgcn.dispatch.id intrinsic.
1183 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the
1184 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
1185 attributes, the queue pointer may be required in situations where the
1186 intrinsic call does not directly appear in the program. Some subtargets
1187 require the queue pointer for to handle some addrspacecasts, as well
1188 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
1189 llvm.debug intrinsics.
1191 "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1192 kernel argument that holds the pointer to the hostcall buffer. If this
1193 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1195 "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1196 kernel argument that holds the pointer to an initialized memory buffer
1197 that conforms to the requirements of the malloc/free device library V1
1198 version implementation. If this attribute is absent, then the
1199 amdgpu-no-implicitarg-ptr is also removed.
1201 "amdgpu-no-multigrid-sync-arg" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1202 kernel argument that holds the multigrid synchronization pointer. If this
1203 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1205 "amdgpu-no-default-queue" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1206 kernel argument that holds the default queue pointer. If this
1207 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1209 "amdgpu-no-completion-action" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1210 kernel argument that holds the completion action pointer. If this
1211 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1213 "amdgpu-lds-size"="min[,max]" Min is the minimum number of bytes that will be allocated in the Local
1214 Data Store at address zero. Variables are allocated within this frame
1215 using absolute symbol metadata, primarily by the AMDGPULowerModuleLDS
1216 pass. Optional max is the maximum number of bytes that will be allocated.
1217 Note that min==max indicates that no further variables can be added to
1218 the frame. This is an internal detail of how LDS variables are lowered,
1219 language front ends should not set this attribute.
1221 ======================================= ==========================================================
1226 The AMDGPU backend supports the following calling conventions:
1228 .. table:: AMDGPU Calling Conventions
1231 =============================== ==========================================================
1232 Calling Convention Description
1233 =============================== ==========================================================
1234 ``ccc`` The C calling convention. Used by default.
1235 See :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions`
1238 ``fastcc`` The fast calling convention. Mostly the same as the ``ccc``.
1240 ``coldcc`` The cold calling convention. Mostly the same as the ``ccc``.
1242 ``amdgpu_cs`` Used for Mesa/AMDPAL compute shaders.
1246 ``amdgpu_cs_chain`` Similar to ``amdgpu_cs``, with differences described below.
1248 Functions with this calling convention cannot be called directly. They must
1249 instead be launched via the ``llvm.amdgcn.cs.chain`` intrinsic.
1251 Arguments are passed in SGPRs, starting at s0, if they have the ``inreg``
1252 attribute, and in VGPRs otherwise, starting at v8. Using more SGPRs or VGPRs
1253 than available in the subtarget is not allowed. On subtargets that use
1254 a scratch buffer descriptor (as opposed to ``scratch_{load,store}_*`` instructions),
1255 the scratch buffer descriptor is passed in s[48:51]. This limits the
1256 SGPR / ``inreg`` arguments to the equivalent of 48 dwords; using more
1257 than that is not allowed.
1259 The return type must be void.
1260 Varargs, sret, byval, byref, inalloca, preallocated are not supported.
1262 Values in scalar registers as well as v0-v7 are not preserved. Values in
1263 VGPRs starting at v8 are not preserved for the active lanes, but must be
1264 saved by the callee for inactive lanes when using WWM.
1266 Wave scratch is "empty" at function boundaries. There is no stack pointer input
1267 or output value, but functions are free to use scratch starting from an initial
1268 stack pointer. Calls to ``amdgpu_gfx`` functions are allowed and behave like they
1269 do in ``amdgpu_cs`` functions.
1271 All counters (``lgkmcnt``, ``vmcnt``, ``storecnt``, etc.) are presumed in an
1272 unknown state at function entry.
1274 A function may have multiple exits (e.g. one chain exit and one plain ``ret void``
1275 for when the wave ends), but all ``llvm.amdgcn.cs.chain`` exits must be in
1276 uniform control flow.
1278 ``amdgpu_cs_chain_preserve`` Same as ``amdgpu_cs_chain``, but active lanes for VGPRs starting at v8 are preserved.
1279 Calls to ``amdgpu_gfx`` functions are not allowed, and any calls to ``llvm.amdgcn.cs.chain``
1280 must not pass more VGPR arguments than the caller's VGPR function parameters.
1282 ``amdgpu_es`` Used for AMDPAL shader stage before geometry shader if geometry is in
1283 use. So either the domain (= tessellation evaluation) shader if
1284 tessellation is in use, or otherwise the vertex shader.
1288 ``amdgpu_gfx`` Used for AMD graphics targets. Functions with this calling convention
1289 cannot be used as entry points.
1293 ``amdgpu_gs`` Used for Mesa/AMDPAL geometry shaders.
1297 ``amdgpu_hs`` Used for Mesa/AMDPAL hull shaders (= tessellation control shaders).
1301 ``amdgpu_kernel`` See :ref:`amdgpu-amdhsa-function-call-convention-kernel-functions`
1303 ``amdgpu_ls`` Used for AMDPAL vertex shader if tessellation is in use.
1307 ``amdgpu_ps`` Used for Mesa/AMDPAL pixel shaders.
1311 ``amdgpu_vs`` Used for Mesa/AMDPAL last shader stage before rasterization (vertex
1312 shader if tessellation and geometry are not in use, or otherwise
1313 copy shader if one is needed).
1317 =============================== ==========================================================
1320 .. _amdgpu-elf-code-object:
1325 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
1326 can be linked by ``lld`` to produce a standard ELF shared code object which can
1327 be loaded and executed on an AMDGPU target.
1329 .. _amdgpu-elf-header:
1334 The AMDGPU backend uses the following ELF header:
1336 .. table:: AMDGPU ELF Header
1337 :name: amdgpu-elf-header-table
1339 ========================== ===============================
1341 ========================== ===============================
1342 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
1343 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
1344 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
1345 - ``ELFOSABI_AMDGPU_HSA``
1346 - ``ELFOSABI_AMDGPU_PAL``
1347 - ``ELFOSABI_AMDGPU_MESA3D``
1348 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1349 - ``ELFABIVERSION_AMDGPU_HSA_V3``
1350 - ``ELFABIVERSION_AMDGPU_HSA_V4``
1351 - ``ELFABIVERSION_AMDGPU_HSA_V5``
1352 - ``ELFABIVERSION_AMDGPU_PAL``
1353 - ``ELFABIVERSION_AMDGPU_MESA3D``
1354 ``e_type`` - ``ET_REL``
1356 ``e_machine`` ``EM_AMDGPU``
1358 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1359 :ref:`amdgpu-elf-header-e_flags-table-v3`,
1360 and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
1361 ========================== ===============================
1365 .. table:: AMDGPU ELF Header Enumeration Values
1366 :name: amdgpu-elf-header-enumeration-values-table
1368 =============================== =====
1370 =============================== =====
1373 ``ELFOSABI_AMDGPU_HSA`` 64
1374 ``ELFOSABI_AMDGPU_PAL`` 65
1375 ``ELFOSABI_AMDGPU_MESA3D`` 66
1376 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1377 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1378 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1379 ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1380 ``ELFABIVERSION_AMDGPU_PAL`` 0
1381 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1382 =============================== =====
1384 ``e_ident[EI_CLASS]``
1387 * ``ELFCLASS32`` for ``r600`` architecture.
1389 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1390 process address space applications.
1392 ``e_ident[EI_DATA]``
1393 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1395 ``e_ident[EI_OSABI]``
1396 One of the following AMDGPU target architecture specific OS ABIs
1397 (see :ref:`amdgpu-os`):
1399 * ``ELFOSABI_NONE`` for *unknown* OS.
1401 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1403 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1405 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1407 ``e_ident[EI_ABIVERSION]``
1408 The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1411 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1412 runtime ABI for code object V2. Specify using the Clang option
1413 ``-mcode-object-version=2``.
1415 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1416 runtime ABI for code object V3. Specify using the Clang option
1417 ``-mcode-object-version=3``.
1419 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1420 runtime ABI for code object V4. Specify using the Clang option
1421 ``-mcode-object-version=4``. This is the default code object
1422 version if not specified.
1424 * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1425 runtime ABI for code object V5. Specify using the Clang option
1426 ``-mcode-object-version=5``.
1428 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1431 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1435 Can be one of the following values:
1439 The type produced by the AMDGPU backend compiler as it is relocatable code
1443 The type produced by the linker as it is a shared code object.
1445 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1448 The value ``EM_AMDGPU`` is used for the machine for all processors supported
1449 by the ``r600`` and ``amdgcn`` architectures (see
1450 :ref:`amdgpu-processor-table`). The specific processor is specified in the
1451 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1452 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1453 ``e_flags`` for code object V3 and above (see
1454 :ref:`amdgpu-elf-header-e_flags-table-v3` and
1455 :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
1458 The entry point is 0 as the entry points for individual kernels must be
1459 selected in order to invoke them through AQL packets.
1462 The AMDGPU backend uses the following ELF header flags:
1464 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1465 :name: amdgpu-elf-header-e_flags-v2-table
1467 ===================================== ===== =============================
1468 Name Value Description
1469 ===================================== ===== =============================
1470 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack``
1472 enabled for all code
1473 contained in the code object.
1475 does not support the
1480 :ref:`amdgpu-target-features`.
1481 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap
1482 handler is enabled for all
1483 code contained in the code
1484 object. If the processor
1485 does not support a trap
1486 handler then must be 0.
1488 :ref:`amdgpu-target-features`.
1489 ===================================== ===== =============================
1491 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1492 :name: amdgpu-elf-header-e_flags-table-v3
1494 ================================= ===== =============================
1495 Name Value Description
1496 ================================= ===== =============================
1497 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1499 ``EF_AMDGPU_MACH_xxx`` values
1501 :ref:`amdgpu-ef-amdgpu-mach-table`.
1502 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack``
1504 enabled for all code
1505 contained in the code object.
1507 does not support the
1512 :ref:`amdgpu-target-features`.
1513 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc``
1515 enabled for all code
1516 contained in the code object.
1518 does not support the
1523 :ref:`amdgpu-target-features`.
1524 ================================= ===== =============================
1526 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
1527 :name: amdgpu-elf-header-e_flags-table-v4-onwards
1529 ============================================ ===== ===================================
1530 Name Value Description
1531 ============================================ ===== ===================================
1532 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1534 ``EF_AMDGPU_MACH_xxx`` values
1536 :ref:`amdgpu-ef-amdgpu-mach-table`.
1537 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for
1538 ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1540 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsupported.
1541 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value.
1542 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled.
1543 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled.
1544 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for
1545 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1547 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsupported.
1548 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value.
1549 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled,
1550 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled.
1551 ============================================ ===== ===================================
1553 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1554 :name: amdgpu-ef-amdgpu-mach-table
1556 ==================================== ========== =============================
1557 Name Value Description (see
1558 :ref:`amdgpu-processor-table`)
1559 ==================================== ========== =============================
1560 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
1561 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
1562 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
1563 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
1564 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
1565 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
1566 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
1567 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
1568 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
1569 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
1570 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
1571 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
1572 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
1573 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
1574 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
1575 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
1576 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
1577 *reserved* 0x011 - Reserved for ``r600``
1578 0x01f architecture processors.
1579 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
1580 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
1581 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
1582 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
1583 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
1584 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
1585 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
1586 *reserved* 0x027 Reserved.
1587 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
1588 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
1589 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
1590 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
1591 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
1592 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
1593 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
1594 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
1595 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908``
1596 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
1597 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c``
1598 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010``
1599 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011``
1600 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012``
1601 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030``
1602 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031``
1603 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032``
1604 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033``
1605 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602``
1606 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705``
1607 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805``
1608 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035``
1609 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034``
1610 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a``
1611 ``EF_AMDGPU_MACH_AMDGCN_GFX940`` 0x040 ``gfx940``
1612 ``EF_AMDGPU_MACH_AMDGCN_GFX1100`` 0x041 ``gfx1100``
1613 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013``
1614 ``EF_AMDGPU_MACH_AMDGCN_GFX1150`` 0x043 ``gfx1150``
1615 ``EF_AMDGPU_MACH_AMDGCN_GFX1103`` 0x044 ``gfx1103``
1616 ``EF_AMDGPU_MACH_AMDGCN_GFX1036`` 0x045 ``gfx1036``
1617 ``EF_AMDGPU_MACH_AMDGCN_GFX1101`` 0x046 ``gfx1101``
1618 ``EF_AMDGPU_MACH_AMDGCN_GFX1102`` 0x047 ``gfx1102``
1619 *reserved* 0x048 Reserved.
1620 *reserved* 0x049 Reserved.
1621 ``EF_AMDGPU_MACH_AMDGCN_GFX1151`` 0x04a ``gfx1151``
1622 ``EF_AMDGPU_MACH_AMDGCN_GFX941`` 0x04b ``gfx941``
1623 ``EF_AMDGPU_MACH_AMDGCN_GFX942`` 0x04c ``gfx942``
1624 ==================================== ========== =============================
1629 An AMDGPU target ELF code object has the standard ELF sections which include:
1631 .. table:: AMDGPU ELF Sections
1632 :name: amdgpu-elf-sections-table
1634 ================== ================ =================================
1635 Name Type Attributes
1636 ================== ================ =================================
1637 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1638 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1639 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
1640 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
1641 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1642 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1643 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1644 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
1645 ``.note`` ``SHT_NOTE`` *none*
1646 ``.rela``\ *name* ``SHT_RELA`` *none*
1647 ``.rela.dyn`` ``SHT_RELA`` *none*
1648 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1649 ``.shstrtab`` ``SHT_STRTAB`` *none*
1650 ``.strtab`` ``SHT_STRTAB`` *none*
1651 ``.symtab`` ``SHT_SYMTAB`` *none*
1652 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1653 ================== ================ =================================
1655 These sections have their standard meanings (see [ELF]_) and are only generated
1659 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1660 information on the DWARF produced by the AMDGPU backend.
1662 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1663 The standard sections used by a dynamic loader.
1666 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1669 ``.rela``\ *name*, ``.rela.dyn``
1670 For relocatable code objects, *name* is the name of the section that the
1671 relocation records apply. For example, ``.rela.text`` is the section name for
1672 relocation records associated with the ``.text`` section.
1674 For linked shared code objects, ``.rela.dyn`` contains all the relocation
1675 records from each of the relocatable code object's ``.rela``\ *name* sections.
1677 See :ref:`amdgpu-relocation-records` for the relocation records supported by
1681 The executable machine code for the kernels and functions they call. Generated
1682 as position independent code. See :ref:`amdgpu-code-conventions` for
1683 information on conventions used in the isa generation.
1685 .. _amdgpu-note-records:
1690 The AMDGPU backend code object contains ELF note records in the ``.note``
1691 section. The set of generated notes and their semantics depend on the code
1692 object version; see :ref:`amdgpu-note-records-v2` and
1693 :ref:`amdgpu-note-records-v3-onwards`.
1695 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1696 must be generated after the ``name`` field to ensure the ``desc`` field is 4
1697 byte aligned. In addition, minimal zero-byte padding must be generated to
1698 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1699 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1702 .. _amdgpu-note-records-v2:
1704 Code Object V2 Note Records
1705 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1708 Code object V2 generation is no longer supported by this version of LLVM.
1710 The AMDGPU backend code object uses the following ELF note record in the
1711 ``.note`` section when compiling for code object V2.
1713 The note record vendor field is "AMD".
1715 Additional note records may be present, but any which are not documented here
1716 are deprecated and should not be used.
1718 .. table:: AMDGPU Code Object V2 ELF Note Records
1719 :name: amdgpu-elf-note-records-v2-table
1721 ===== ===================================== ======================================
1722 Name Type Description
1723 ===== ===================================== ======================================
1724 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version.
1725 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL
1726 Finalizer and not the LLVM compiler.
1727 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version.
1728 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in
1729 YAML [YAML]_ textual format.
1730 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name.
1731 ===== ===================================== ======================================
1735 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1736 :name: amdgpu-elf-note-record-enumeration-values-v2-table
1738 ===================================== =====
1740 ===================================== =====
1741 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1
1742 ``NT_AMD_HSA_HSAIL`` 2
1743 ``NT_AMD_HSA_ISA_VERSION`` 3
1745 ``NT_AMD_HSA_METADATA`` 10
1746 ``NT_AMD_HSA_ISA_NAME`` 11
1747 ===================================== =====
1749 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
1750 Specifies the code object version number. The description field has the
1755 struct amdgpu_hsa_note_code_object_version_s {
1756 uint32_t major_version;
1757 uint32_t minor_version;
1760 The ``major_version`` has a value less than or equal to 2.
1762 ``NT_AMD_HSA_HSAIL``
1763 Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1764 field has the following layout:
1768 struct amdgpu_hsa_note_hsail_s {
1769 uint32_t hsail_major_version;
1770 uint32_t hsail_minor_version;
1772 uint8_t machine_model;
1773 uint8_t default_float_round;
1776 ``NT_AMD_HSA_ISA_VERSION``
1777 Specifies the target ISA version. The description field has the following layout:
1781 struct amdgpu_hsa_note_isa_s {
1782 uint16_t vendor_name_size;
1783 uint16_t architecture_name_size;
1787 char vendor_and_architecture_name[1];
1790 ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1791 vendor and architecture names respectively, including the NUL character.
1793 ``vendor_and_architecture_name`` contains the NUL terminates string for the
1794 vendor, immediately followed by the NUL terminated string for the
1797 This note record is used by the HSA runtime loader.
1799 Code object V2 only supports a limited number of processors and has fixed
1800 settings for target features. See
1801 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1802 processors and the corresponding target ID. In the table the note record ISA
1803 name is a concatenation of the vendor name, architecture name, major, minor,
1804 and stepping separated by a ":".
1806 The target ID column shows the processor name and fixed target features used
1807 by the LLVM compiler. The LLVM compiler does not generate a
1808 ``NT_AMD_HSA_HSAIL`` note record.
1810 A code object generated by the Finalizer also uses code object V2 and always
1811 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1812 ``sramecc`` target feature is as shown in
1813 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1814 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1817 ``NT_AMD_HSA_ISA_NAME``
1818 Specifies the target ISA name as a non-NUL terminated string.
1820 This note record is not used by the HSA runtime loader.
1822 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1823 V2's limited support of processors and fixed settings for target features.
1825 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1826 from the string to the corresponding target ID. If the ``xnack`` target
1827 feature is supported and enabled, the string produced by the LLVM compiler
1828 will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1829 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1831 ``NT_AMD_HSA_METADATA``
1832 Specifies extensible metadata associated with the code objects executed on HSA
1833 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1834 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1835 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1838 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1839 :name: amdgpu-elf-note-record-supported_processors-v2-table
1841 ===================== ==========================
1842 Note Record ISA Name Target ID
1843 ===================== ==========================
1844 ``AMD:AMDGPU:6:0:0`` ``gfx600``
1845 ``AMD:AMDGPU:6:0:1`` ``gfx601``
1846 ``AMD:AMDGPU:6:0:2`` ``gfx602``
1847 ``AMD:AMDGPU:7:0:0`` ``gfx700``
1848 ``AMD:AMDGPU:7:0:1`` ``gfx701``
1849 ``AMD:AMDGPU:7:0:2`` ``gfx702``
1850 ``AMD:AMDGPU:7:0:3`` ``gfx703``
1851 ``AMD:AMDGPU:7:0:4`` ``gfx704``
1852 ``AMD:AMDGPU:7:0:5`` ``gfx705``
1853 ``AMD:AMDGPU:8:0:0`` ``gfx802``
1854 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+``
1855 ``AMD:AMDGPU:8:0:2`` ``gfx802``
1856 ``AMD:AMDGPU:8:0:3`` ``gfx803``
1857 ``AMD:AMDGPU:8:0:4`` ``gfx803``
1858 ``AMD:AMDGPU:8:0:5`` ``gfx805``
1859 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+``
1860 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-``
1861 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+``
1862 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-``
1863 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+``
1864 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-``
1865 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+``
1866 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-``
1867 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+``
1868 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1869 ===================== ==========================
1871 .. _amdgpu-note-records-v3-onwards:
1873 Code Object V3 and Above Note Records
1874 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1876 The AMDGPU backend code object uses the following ELF note record in the
1877 ``.note`` section when compiling for code object V3 and above.
1879 The note record vendor field is "AMDGPU".
1881 Additional note records may be present, but any which are not documented here
1882 are deprecated and should not be used.
1884 .. table:: AMDGPU Code Object V3 and Above ELF Note Records
1885 :name: amdgpu-elf-note-records-table-v3-onwards
1887 ======== ============================== ======================================
1888 Name Type Description
1889 ======== ============================== ======================================
1890 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
1892 ======== ============================== ======================================
1896 .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
1897 :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
1899 ============================== =====
1901 ============================== =====
1903 ``NT_AMDGPU_METADATA`` 32
1904 ============================== =====
1906 ``NT_AMDGPU_METADATA``
1907 Specifies extensible metadata associated with an AMDGPU code object. It is
1908 encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1909 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
1910 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
1911 :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
1919 Symbols include the following:
1921 .. table:: AMDGPU ELF Symbols
1922 :name: amdgpu-elf-symbols-table
1924 ===================== ================== ================ ==================
1925 Name Type Section Description
1926 ===================== ================== ================ ==================
1927 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
1930 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
1931 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
1932 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS
1933 ===================== ================== ================ ==================
1936 Global variables both used and defined by the compilation unit.
1938 If the symbol is defined in the compilation unit then it is allocated in the
1939 appropriate section according to if it has initialized data or is readonly.
1941 If the symbol is external then its section is ``STN_UNDEF`` and the loader
1942 will resolve relocations using the definition provided by another code object
1943 or explicitly defined by the runtime.
1945 If the symbol resides in local/group memory (LDS) then its section is the
1946 special processor specific section name ``SHN_AMDGPU_LDS``, and the
1947 ``st_value`` field describes alignment requirements as it does for common
1952 Add description of linked shared object symbols. Seems undefined symbols
1953 are marked as STT_NOTYPE.
1956 Every HSA kernel has an associated kernel descriptor. It is the address of the
1957 kernel descriptor that is used in the AQL dispatch packet used to invoke the
1958 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1959 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1962 Every HSA kernel also has a symbol for its machine code entry point.
1964 .. _amdgpu-relocation-records:
1969 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1970 relocatable fields are:
1973 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1974 alignment. These values use the same byte order as other word values in the
1975 AMDGPU architecture.
1978 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1979 alignment. These values use the same byte order as other word values in the
1980 AMDGPU architecture.
1982 Following notations are used for specifying relocation calculations:
1985 Represents the addend used to compute the value of the relocatable field.
1988 Represents the offset into the global offset table at which the relocation
1989 entry's symbol will reside during execution.
1992 Represents the address of the global offset table.
1995 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1996 of the storage unit being relocated (computed using ``r_offset``).
1999 Represents the value of the symbol whose index resides in the relocation
2000 entry. Relocations not using this must specify a symbol index of
2004 Represents the base address of a loaded executable or shared object which is
2005 the difference between the ELF address and the actual load address.
2006 Relocations using this are only valid in executable or shared objects.
2008 The following relocation types are supported:
2010 .. table:: AMDGPU ELF Relocation Records
2011 :name: amdgpu-elf-relocation-records-table
2013 ========================== ======= ===== ========== ==============================
2014 Relocation Type Kind Value Field Calculation
2015 ========================== ======= ===== ========== ==============================
2016 ``R_AMDGPU_NONE`` 0 *none* *none*
2017 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
2019 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
2021 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
2023 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
2024 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
2025 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
2027 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
2028 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
2029 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
2030 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
2031 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
2033 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
2034 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4
2035 ========================== ======= ===== ========== ==============================
2037 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
2038 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
2040 There is no current OS loader support for 32-bit programs and so
2041 ``R_AMDGPU_ABS32`` is not used.
2043 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
2045 Loaded Code Object Path Uniform Resource Identifier (URI)
2046 ---------------------------------------------------------
2048 The AMD GPU code object loader represents the path of the ELF shared object from
2049 which the code object was loaded as a textual Uniform Resource Identifier (URI).
2050 Note that the code object is the in memory loaded relocated form of the ELF
2051 shared object. Multiple code objects may be loaded at different memory
2052 addresses in the same process from the same ELF shared object.
2054 The loaded code object path URI syntax is defined by the following BNF syntax:
2058 code_object_uri ::== file_uri | memory_uri
2059 file_uri ::== "file://" file_path [ range_specifier ]
2060 memory_uri ::== "memory://" process_id range_specifier
2061 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
2062 file_path ::== URI_ENCODED_OS_FILE_PATH
2063 process_id ::== DECIMAL_NUMBER
2064 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
2067 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
2068 and octal values by "0".
2071 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
2072 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
2073 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in
2074 the path are separated by "/".
2077 Is a 0-based byte offset to the start of the code object. For a file URI, it
2078 is from the start of the file specified by the ``file_path``, and if omitted
2079 defaults to 0. For a memory URI, it is the memory address and is required.
2082 Is the number of bytes in the code object. For a file URI, if omitted it
2083 defaults to the size of the file. It is required for a memory URI.
2086 Is the identity of the process owning the memory. For Linux it is the C
2087 unsigned integral decimal literal for the process ID (PID).
2093 file:///dir1/dir2/file1
2094 file:///dir3/dir4/file2#offset=0x2000&size=3000
2095 memory://1234#offset=0x20000&size=3000
2097 .. _amdgpu-dwarf-debug-information:
2099 DWARF Debug Information
2100 =======================
2104 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
2105 is not currently fully implemented and is subject to change.
2107 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
2108 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
2109 object executable code and data to the source language constructs. It can be
2110 used by tools such as debuggers and profilers. It uses features defined in
2111 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
2112 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
2114 This section defines the AMDGPU target architecture specific DWARF mappings.
2116 .. _amdgpu-dwarf-register-identifier:
2121 This section defines the AMDGPU target architecture register numbers used in
2122 DWARF operation expressions (see DWARF Version 5 section 2.5 and
2123 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
2124 instructions (see DWARF Version 5 section 6.4 and
2125 :ref:`amdgpu-dwarf-call-frame-information`).
2127 A single code object can contain code for kernels that have different wavefront
2128 sizes. The vector registers and some scalar registers are based on the wavefront
2129 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
2130 simplifies the consumer of the DWARF so that each register has a fixed size,
2131 rather than being dynamic according to the wavefront size mode. Similarly,
2132 distinct DWARF registers are defined for those registers that vary in size
2133 according to the process address size. This allows a consumer to treat a
2134 specific AMDGPU processor as a single architecture regardless of how it is
2135 configured at run time. The compiler explicitly specifies the DWARF registers
2136 that match the mode in which the code it is generating will be executed.
2138 DWARF registers are encoded as numbers, which are mapped to architecture
2139 registers. The mapping for AMDGPU is defined in
2140 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
2143 .. table:: AMDGPU DWARF Register Mapping
2144 :name: amdgpu-dwarf-register-mapping-table
2146 ============== ================= ======== ==================================
2147 DWARF Register AMDGPU Register Bit Size Description
2148 ============== ================= ======== ==================================
2149 0 PC_32 32 Program Counter (PC) when
2150 executing in a 32-bit process
2151 address space. Used in the CFI to
2152 describe the PC of the calling
2154 1 EXEC_MASK_32 32 Execution Mask Register when
2155 executing in wavefront 32 mode.
2156 2-15 *Reserved* *Reserved for highly accessed
2157 registers using DWARF shortcut.*
2158 16 PC_64 64 Program Counter (PC) when
2159 executing in a 64-bit process
2160 address space. Used in the CFI to
2161 describe the PC of the calling
2163 17 EXEC_MASK_64 64 Execution Mask Register when
2164 executing in wavefront 64 mode.
2165 18-31 *Reserved* *Reserved for highly accessed
2166 registers using DWARF shortcut.*
2167 32-95 SGPR0-SGPR63 32 Scalar General Purpose
2169 96-127 *Reserved* *Reserved for frequently accessed
2170 registers using DWARF 1-byte ULEB.*
2171 128 STATUS 32 Status Register.
2172 129-511 *Reserved* *Reserved for future Scalar
2173 Architectural Registers.*
2174 512 VCC_32 32 Vector Condition Code Register
2175 when executing in wavefront 32
2177 513-767 *Reserved* *Reserved for future Vector
2178 Architectural Registers when
2179 executing in wavefront 32 mode.*
2180 768 VCC_64 64 Vector Condition Code Register
2181 when executing in wavefront 64
2183 769-1023 *Reserved* *Reserved for future Vector
2184 Architectural Registers when
2185 executing in wavefront 64 mode.*
2186 1024-1087 *Reserved* *Reserved for padding.*
2187 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers.
2188 1130-1535 *Reserved* *Reserved for future Scalar
2189 General Purpose Registers.*
2190 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers
2191 when executing in wavefront 32
2193 1792-2047 *Reserved* *Reserved for future Vector
2194 General Purpose Registers when
2195 executing in wavefront 32 mode.*
2196 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers
2197 when executing in wavefront 32
2199 2304-2559 *Reserved* *Reserved for future Vector
2200 Accumulation Registers when
2201 executing in wavefront 32 mode.*
2202 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers
2203 when executing in wavefront 64
2205 2816-3071 *Reserved* *Reserved for future Vector
2206 General Purpose Registers when
2207 executing in wavefront 64 mode.*
2208 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers
2209 when executing in wavefront 64
2211 3328-3583 *Reserved* *Reserved for future Vector
2212 Accumulation Registers when
2213 executing in wavefront 64 mode.*
2214 ============== ================= ======== ==================================
2216 The vector registers are represented as the full size for the wavefront. They
2217 are organized as consecutive dwords (32-bits), one per lane, with the dword at
2218 the least significant bit position corresponding to lane 0 and so forth. DWARF
2219 location expressions involving the ``DW_OP_LLVM_offset`` and
2220 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
2221 register corresponding to the lane that is executing the current thread of
2222 execution in languages that are implemented using a SIMD or SIMT execution
2225 If the wavefront size is 32 lanes then the wavefront 32 mode register
2226 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
2227 mode register definitions are used. Some AMDGPU targets support executing in
2228 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
2229 to the wavefront mode of the generated code will be used.
2231 If code is generated to execute in a 32-bit process address space, then the
2232 32-bit process address space register definitions are used. If code is generated
2233 to execute in a 64-bit process address space, then the 64-bit process address
2234 space register definitions are used. The ``amdgcn`` target only supports the
2235 64-bit process address space.
2237 .. _amdgpu-dwarf-memory-space-identifier:
2239 Memory Space Identifier
2240 -----------------------
2242 The DWARF memory space represents the source language memory space. See DWARF
2243 Version 5 section 2.12 which is updated by the *DWARF Extensions For
2244 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-memory-spaces`.
2246 The DWARF memory space mapping used for AMDGPU is defined in
2247 :ref:`amdgpu-dwarf-memory-space-mapping-table`.
2249 .. table:: AMDGPU DWARF Memory Space Mapping
2250 :name: amdgpu-dwarf-memory-space-mapping-table
2252 =========================== ====== =================
2254 ---------------------------------- -----------------
2255 Memory Space Name Value Memory Space
2256 =========================== ====== =================
2257 ``DW_MSPACE_LLVM_none`` 0x0000 Generic (Flat)
2258 ``DW_MSPACE_LLVM_global`` 0x0001 Global
2259 ``DW_MSPACE_LLVM_constant`` 0x0002 Global
2260 ``DW_MSPACE_LLVM_group`` 0x0003 Local (group/LDS)
2261 ``DW_MSPACE_LLVM_private`` 0x0004 Private (Scratch)
2262 ``DW_MSPACE_AMDGPU_region`` 0x8000 Region (GDS)
2263 =========================== ====== =================
2265 The DWARF memory space values defined in the *DWARF Extensions For Heterogeneous
2266 Debugging* section :ref:`amdgpu-dwarf-memory-spaces` are used.
2268 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
2269 available for use for the AMD extension for access to the hardware GDS memory
2270 which is scratchpad memory allocated per device.
2272 For AMDGPU if no ``DW_AT_LLVM_memory_space`` attribute is present, then the
2273 default memory space of ``DW_MSPACE_LLVM_none`` is used.
2275 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
2276 mapping of DWARF memory spaces to DWARF address spaces, including address size
2279 .. _amdgpu-dwarf-address-space-identifier:
2281 Address Space Identifier
2282 ------------------------
2284 DWARF address spaces correspond to target architecture specific linear
2285 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
2286 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-address-spaces`.
2288 The DWARF address space mapping used for AMDGPU is defined in
2289 :ref:`amdgpu-dwarf-address-space-mapping-table`.
2291 .. table:: AMDGPU DWARF Address Space Mapping
2292 :name: amdgpu-dwarf-address-space-mapping-table
2294 ======================================= ===== ======= ======== ===================== =======================
2296 --------------------------------------- ----- ---------------- --------------------- -----------------------
2297 Address Space Name Value Address Bit Size LLVM IR Address Space
2298 --------------------------------------- ----- ------- -------- --------------------- -----------------------
2303 ======================================= ===== ======= ======== ===================== =======================
2304 ``DW_ASPACE_LLVM_none`` 0x00 64 32 Global *default address space*
2305 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat)
2306 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS)
2307 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS)
2309 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane*
2310 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront*
2311 ======================================= ===== ======= ======== ===================== =======================
2313 See :ref:`amdgpu-address-spaces` for information on the AMDGPU LLVM IR address
2314 spaces including address size and NULL value.
2316 The ``DW_ASPACE_LLVM_none`` address space is the default target architecture
2317 address space used in DWARF operations that do not specify an address space. It
2318 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
2319 related operations can refer to addresses in the program code.
2321 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
2322 specify the flat address space. If the address corresponds to an address in the
2323 local address space, then it corresponds to the wavefront that is executing the
2324 focused thread of execution. If the address corresponds to an address in the
2325 private address space, then it corresponds to the lane that is executing the
2326 focused thread of execution for languages that are implemented using a SIMD or
2327 SIMT execution model.
2331 CUDA-like languages such as HIP that do not have address spaces in the
2332 language type system, but do allow variables to be allocated in different
2333 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
2334 address space in the DWARF expression operations as the default address space
2335 is the global address space.
2337 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2338 specify the local address space corresponding to the wavefront that is executing
2339 the focused thread of execution.
2341 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2342 to specify the private address space corresponding to the lane that is executing
2343 the focused thread of execution for languages that are implemented using a SIMD
2344 or SIMT execution model.
2346 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2347 to specify the unswizzled private address space corresponding to the wavefront
2348 that is executing the focused thread of execution. The wavefront view of private
2349 memory is the per wavefront unswizzled backing memory layout defined in
2350 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2351 location for the backing memory of the wavefront (namely the address is not
2352 offset by ``wavefront-scratch-base``). The following formula can be used to
2353 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2354 ``DW_ASPACE_AMDGPU_private_wave`` address:
2358 private-address-wavefront =
2359 ((private-address-lane / 4) * wavefront-size * 4) +
2360 (wavefront-lane-id * 4) + (private-address-lane % 4)
2362 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2363 of the dwords for each lane starting with lane 0 is required, then this
2368 private-address-wavefront =
2369 private-address-lane * wavefront-size
2371 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2372 complete spilled vector register back into a complete vector register in the
2373 CFI. The frame pointer can be a private lane address which is dword aligned,
2374 which can be shifted to multiply by the wavefront size, and then used to form a
2375 private wavefront address that gives a location for a contiguous set of dwords,
2376 one per lane, where the vector register dwords are spilled. The compiler knows
2377 the wavefront size since it generates the code. Note that the type of the
2378 address may have to be converted as the size of a
2379 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2380 ``DW_ASPACE_AMDGPU_private_wave`` address.
2382 .. _amdgpu-dwarf-lane-identifier:
2387 DWARF lane identifies specify a target architecture lane position for hardware
2388 that executes in a SIMD or SIMT manner, and on which a source language maps its
2389 threads of execution onto those lanes. The DWARF lane identifier is pushed by
2390 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2391 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2392 section :ref:`amdgpu-dwarf-operation-expressions`.
2394 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2395 wavefront. It is numbered from 0 to the wavefront size minus 1.
2397 Operation Expressions
2398 ---------------------
2400 DWARF expressions are used to compute program values and the locations of
2401 program objects. See DWARF Version 5 section 2.5 and
2402 :ref:`amdgpu-dwarf-operation-expressions`.
2404 DWARF location descriptions describe how to access storage which includes memory
2405 and registers. When accessing storage on AMDGPU, bytes are ordered with least
2406 significant bytes first, and bits are ordered within bytes with least
2407 significant bits first.
2409 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2410 unwinding vector registers that are spilled under the execution mask to memory:
2411 the zero-single location description is the vector register, and the one-single
2412 location description is the spilled memory location description. The
2413 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2414 memory location description.
2416 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2417 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2418 controlled by the execution mask. An undefined location description together
2419 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2420 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2422 Debugger Information Entry Attributes
2423 -------------------------------------
2425 This section describes how certain debugger information entry attributes are
2426 used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2427 which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2428 :ref:`amdgpu-dwarf-low-level-information` and
2429 :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2431 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2433 ``DW_AT_LLVM_lane_pc``
2434 ~~~~~~~~~~~~~~~~~~~~~~
2436 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2437 location of the separate lanes of a SIMT thread.
2439 If the lane is an active lane then this will be the same as the current program
2442 If the lane is inactive, but was active on entry to the subprogram, then this is
2443 the program location in the subprogram at which execution of the lane is
2444 conceptual positioned.
2446 If the lane was not active on entry to the subprogram, then this will be the
2447 undefined location. A client debugger can check if the lane is part of a valid
2448 work-group by checking that the lane is in the range of the associated
2449 work-group within the grid, accounting for partial work-groups. If it is not,
2450 then the debugger can omit any information for the lane. Otherwise, the debugger
2451 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2452 calling subprogram until it finds a non-undefined location. Conceptually the
2453 lane only has the call frames that it has a non-undefined
2454 ``DW_AT_LLVM_lane_pc``.
2456 The following example illustrates how the AMDGPU backend can generate a DWARF
2457 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2458 following subprogram pseudo code for a target with 64 lanes per wavefront.
2480 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2481 execution mask (``EXEC``) to linearize the control flow. The condition is
2482 evaluated to make a mask of the lanes for which the condition evaluates to true.
2483 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2484 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2485 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2486 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2487 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2488 region. This is shown below. Other approaches are possible, but the basic
2489 concept is the same.
2522 To create the DWARF location list expression that defines the location
2523 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2524 pseudo instruction can be used to annotate the linearized control flow. This can
2525 be done by defining an artificial variable for the lane PC. The DWARF location
2526 list expression created for it is used as the value of the
2527 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2529 A DWARF procedure is defined for each well nested structured control flow region
2530 which provides the conceptual lane program location for a lane if it is not
2531 active (namely it is divergent). The DWARF operation expression for each region
2532 conceptually inherits the value of the immediately enclosing region and modifies
2533 it according to the semantics of the region.
2535 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2536 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2537 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2538 region since the ``THEN`` region has completed.
2540 The lane PC artificial variable is assigned at each region transition. It uses
2541 the immediately enclosing region's DWARF procedure to compute the program
2542 location for each lane assuming they are divergent, and then modifies the result
2543 by inserting the current program location for each lane that the ``EXEC`` mask
2544 indicates is active.
2546 By having separate DWARF procedures for each region, they can be reused to
2547 define the value for any nested region. This reduces the total size of the DWARF
2548 operation expressions.
2550 The following provides an example using pseudo LLVM MIR.
2556 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2557 DW_AT_name = "__uint64";
2558 DW_AT_byte_size = 8;
2559 DW_AT_encoding = DW_ATE_unsigned;
2561 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2562 DW_AT_name = "__active_lane_pc";
2565 DW_OP_LLVM_extend 64, 64;
2566 DW_OP_regval_type EXEC, %uint_64;
2567 DW_OP_LLVM_select_bit_piece 64, 64;
2570 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2571 DW_AT_name = "__divergent_lane_pc";
2573 DW_OP_LLVM_undefined;
2574 DW_OP_LLVM_extend 64, 64;
2577 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2578 DW_OP_call_ref %__divergent_lane_pc;
2579 DW_OP_call_ref %__active_lane_pc;
2583 DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2588 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2589 DW_AT_name = "__divergent_lane_pc_1_then";
2590 DW_AT_location = DIExpression[
2591 DW_OP_call_ref %__divergent_lane_pc;
2592 DW_OP_addrx &lex_1_start;
2594 DW_OP_LLVM_extend 64, 64;
2595 DW_OP_call_ref %__lex_1_save_exec;
2596 DW_OP_deref_type 64, %__uint_64;
2597 DW_OP_LLVM_select_bit_piece 64, 64;
2600 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2601 DW_OP_call_ref %__divergent_lane_pc_1_then;
2602 DW_OP_call_ref %__active_lane_pc;
2606 DBG_VALUE %3, %__lex_1_1_save_exec;
2611 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2612 DW_AT_name = "__divergent_lane_pc_1_1_then";
2613 DW_AT_location = DIExpression[
2614 DW_OP_call_ref %__divergent_lane_pc_1_then;
2615 DW_OP_addrx &lex_1_1_start;
2617 DW_OP_LLVM_extend 64, 64;
2618 DW_OP_call_ref %__lex_1_1_save_exec;
2619 DW_OP_deref_type 64, %__uint_64;
2620 DW_OP_LLVM_select_bit_piece 64, 64;
2623 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2624 DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2625 DW_OP_call_ref %__active_lane_pc;
2630 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2631 DW_AT_name = "__divergent_lane_pc_1_1_else";
2632 DW_AT_location = DIExpression[
2633 DW_OP_call_ref %__divergent_lane_pc_1_then;
2634 DW_OP_addrx &lex_1_1_end;
2636 DW_OP_LLVM_extend 64, 64;
2637 DW_OP_call_ref %__lex_1_1_save_exec;
2638 DW_OP_deref_type 64, %__uint_64;
2639 DW_OP_LLVM_select_bit_piece 64, 64;
2642 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2643 DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2644 DW_OP_call_ref %__active_lane_pc;
2649 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2650 DW_OP_call_ref %__divergent_lane_pc;
2651 DW_OP_call_ref %__active_lane_pc;
2656 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2657 DW_AT_name = "__divergent_lane_pc_1_else";
2658 DW_AT_location = DIExpression[
2659 DW_OP_call_ref %__divergent_lane_pc;
2660 DW_OP_addrx &lex_1_end;
2662 DW_OP_LLVM_extend 64, 64;
2663 DW_OP_call_ref %__lex_1_save_exec;
2664 DW_OP_deref_type 64, %__uint_64;
2665 DW_OP_LLVM_select_bit_piece 64, 64;
2668 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2669 DW_OP_call_ref %__divergent_lane_pc_1_else;
2670 DW_OP_call_ref %__active_lane_pc;
2675 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2676 DW_OP_call_ref %__divergent_lane_pc;
2677 DW_OP_call_ref %__active_lane_pc;
2682 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2683 that are active, with the current program location.
2685 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2686 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2687 instruction, location list entries will be created that describe where the
2688 artificial variables are allocated at any given program location. The compiler
2689 may allocate them to registers or spill them to memory.
2691 The DWARF procedures for each region use the values of the saved execution mask
2692 artificial variables to only update the lanes that are active on entry to the
2693 region. All other lanes retain the value of the enclosing region where they were
2694 last active. If they were not active on entry to the subprogram, then will have
2695 the undefined location description.
2697 Other structured control flow regions can be handled similarly. For example,
2698 loops would set the divergent program location for the region at the end of the
2699 loop. Any lanes active will be in the loop, and any lanes not active must have
2702 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2703 ``IF/THEN/ELSE`` regions.
2705 The DWARF procedures can use the active lane artificial variable described in
2706 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2707 ``EXEC`` mask in order to support whole or quad wavefront mode.
2709 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2711 ``DW_AT_LLVM_active_lane``
2712 ~~~~~~~~~~~~~~~~~~~~~~~~~~
2714 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2715 entry is used to specify the lanes that are conceptually active for a SIMT
2718 The execution mask may be modified to implement whole or quad wavefront mode
2719 operations. For example, all lanes may need to temporarily be made active to
2720 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2721 update it to enable the necessary lanes, perform the operations, and then
2722 restore the ``EXEC`` mask from the saved value. While executing the whole
2723 wavefront region, the conceptual execution mask is the saved value, not the
2726 This is handled by defining an artificial variable for the active lane mask. The
2727 active lane mask artificial variable would be the actual ``EXEC`` mask for
2728 normal regions, and the saved execution mask for regions where the mask is
2729 temporarily updated. The location list expression created for this artificial
2730 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2733 ``DW_AT_LLVM_augmentation``
2734 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2736 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2737 debugger information entry has the following value for the augmentation string:
2743 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2744 extensions used in the DWARF of the compilation unit. The version number
2745 conforms to [SEMVER]_.
2747 Call Frame Information
2748 ----------------------
2750 DWARF Call Frame Information (CFI) describes how a consumer can virtually
2751 *unwind* call frames in a running process or core dump. See DWARF Version 5
2752 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2754 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2756 1. ``augmentation`` string contains the following null-terminated UTF-8 string:
2762 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2763 extensions used in this CIE or to the FDEs that use it. The version number
2764 conforms to [SEMVER]_.
2766 2. ``address_size`` for the ``Global`` address space is defined in
2767 :ref:`amdgpu-dwarf-address-space-identifier`.
2769 3. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2771 4. ``code_alignment_factor`` is 4 bytes.
2775 Add to :ref:`amdgpu-processor-table` table.
2777 5. ``data_alignment_factor`` is 4 bytes.
2781 Add to :ref:`amdgpu-processor-table` table.
2783 6. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2784 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2786 7. ``initial_instructions`` Since a subprogram X with fewer registers can be
2787 called from subprogram Y that has more allocated, X will not change any of
2788 the extra registers as it cannot access them. Therefore, the default rule
2789 for all columns is ``same value``.
2791 For AMDGPU the register number follows the numbering defined in
2792 :ref:`amdgpu-dwarf-register-identifier`.
2794 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2795 the return address to get the address of a byte within the call site
2796 instructions. See DWARF Version 5 section 6.4.4.
2801 See DWARF Version 5 section 6.1.
2803 Lookup By Name Section Header
2804 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2806 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2808 For AMDGPU the lookup by name section header table:
2810 ``augmentation_string_size`` (uword)
2812 Set to the length of the ``augmentation_string`` value which is always a
2815 ``augmentation_string`` (sequence of UTF-8 characters)
2817 Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2823 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2824 extensions used in the DWARF of this index. The version number conforms to
2829 This is different to the DWARF Version 5 definition that requires the first
2830 4 characters to be the vendor ID. But this is consistent with the other
2831 augmentation strings and does allow multiple vendor contributions. However,
2832 backwards compatibility may be more desirable.
2834 Lookup By Address Section Header
2835 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2837 See DWARF Version 5 section 6.1.2.
2839 For AMDGPU the lookup by address section header table:
2841 ``address_size`` (ubyte)
2843 Match the address size for the ``Global`` address space defined in
2844 :ref:`amdgpu-dwarf-address-space-identifier`.
2846 ``segment_selector_size`` (ubyte)
2848 AMDGPU does not use a segment selector so this is 0. The entries in the
2849 ``.debug_aranges`` do not have a segment selector.
2851 Line Number Information
2852 -----------------------
2854 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2856 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2857 The instruction set must be obtained from the ELF file header ``e_flags`` field
2858 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2859 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2863 Should the ``isa`` state machine register be used to indicate if the code is
2864 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2866 For AMDGPU the line number program header fields have the following values (see
2867 DWARF Version 5 section 6.2.4):
2869 ``address_size`` (ubyte)
2870 Matches the address size for the ``Global`` address space defined in
2871 :ref:`amdgpu-dwarf-address-space-identifier`.
2873 ``segment_selector_size`` (ubyte)
2874 AMDGPU does not use a segment selector so this is 0.
2876 ``minimum_instruction_length`` (ubyte)
2877 For GFX9-GFX11 this is 4.
2879 ``maximum_operations_per_instruction`` (ubyte)
2880 For GFX9-GFX11 this is 1.
2882 Source text for online-compiled programs (for example, those compiled by the
2883 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2884 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2885 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2886 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2888 The Clang option used to control source embedding in AMDGPU is defined in
2889 :ref:`amdgpu-clang-debug-options-table`.
2891 .. table:: AMDGPU Clang Debug Options
2892 :name: amdgpu-clang-debug-options-table
2894 ==================== ==================================================
2895 Debug Flag Description
2896 ==================== ==================================================
2897 -g[no-]embed-source Enable/disable embedding source text in DWARF
2898 debug sections. Useful for environments where
2899 source cannot be written to disk, such as
2900 when performing online compilation.
2901 ==================== ==================================================
2906 Enable the embedded source.
2908 ``-gno-embed-source``
2909 Disable the embedded source.
2911 32-Bit and 64-Bit DWARF Formats
2912 -------------------------------
2914 See DWARF Version 5 section 7.4 and
2915 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2919 * For the ``amdgcn`` target architecture only the 64-bit process address space
2922 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2923 the 32-bit DWARF format.
2928 For AMDGPU the following values apply for each of the unit headers described in
2929 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2931 ``address_size`` (ubyte)
2932 Matches the address size for the ``Global`` address space defined in
2933 :ref:`amdgpu-dwarf-address-space-identifier`.
2935 .. _amdgpu-code-conventions:
2940 This section provides code conventions used for each supported target triple OS
2941 (see :ref:`amdgpu-target-triples`).
2946 This section provides code conventions used when the target triple OS is
2947 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
2949 .. _amdgpu-amdhsa-code-object-metadata:
2951 Code Object Metadata
2952 ~~~~~~~~~~~~~~~~~~~~
2954 The code object metadata specifies extensible metadata associated with the code
2955 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2956 encoding and semantics of this metadata depends on the code object version; see
2957 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2958 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2959 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2960 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
2962 Code object metadata is specified in a note record (see
2963 :ref:`amdgpu-note-records`) and is required when the target triple OS is
2964 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2965 information necessary to support the HSA compatible runtime kernel queries. For
2966 example, the segment sizes needed in a dispatch packet. In addition, a
2967 high-level language runtime may require other information to be included. For
2968 example, the AMD OpenCL runtime records kernel argument information.
2970 .. _amdgpu-amdhsa-code-object-metadata-v2:
2972 Code Object V2 Metadata
2973 +++++++++++++++++++++++
2976 Code object V2 generation is no longer supported by this version of LLVM.
2978 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2979 (see :ref:`amdgpu-note-records-v2`).
2981 The metadata is specified as a YAML formatted string (see [YAML]_ and
2986 Is the string null terminated? It probably should not if YAML allows it to
2987 contain null characters, otherwise it should be.
2989 The metadata is represented as a single YAML document comprised of the mapping
2990 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2993 For boolean values, the string values of ``false`` and ``true`` are used for
2994 false and true respectively.
2996 Additional information can be added to the mappings. To avoid conflicts, any
2997 non-AMD key names should be prefixed by "*vendor-name*.".
2999 .. table:: AMDHSA Code Object V2 Metadata Map
3000 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
3002 ========== ============== ========= =======================================
3003 String Key Value Type Required? Description
3004 ========== ============== ========= =======================================
3005 "Version" sequence of Required - The first integer is the major
3006 2 integers version. Currently 1.
3007 - The second integer is the minor
3008 version. Currently 0.
3009 "Printf" sequence of Each string is encoded information
3010 strings about a printf function call. The
3011 encoded information is organized as
3012 fields separated by colon (':'):
3014 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3019 A 32-bit integer as a unique id for
3020 each printf function call
3023 A 32-bit integer equal to the number
3024 of arguments of printf function call
3027 ``S[i]`` (where i = 0, 1, ... , N-1)
3028 32-bit integers for the size in bytes
3029 of the i-th FormatString argument of
3030 the printf function call
3033 The format string passed to the
3034 printf function call.
3035 "Kernels" sequence of Required Sequence of the mappings for each
3036 mapping kernel in the code object. See
3037 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
3038 for the definition of the mapping.
3039 ========== ============== ========= =======================================
3043 .. table:: AMDHSA Code Object V2 Kernel Metadata Map
3044 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
3046 ================= ============== ========= ================================
3047 String Key Value Type Required? Description
3048 ================= ============== ========= ================================
3049 "Name" string Required Source name of the kernel.
3050 "SymbolName" string Required Name of the kernel
3051 descriptor ELF symbol.
3052 "Language" string Source language of the kernel.
3060 "LanguageVersion" sequence of - The first integer is the major
3062 - The second integer is the
3064 "Attrs" mapping Mapping of kernel attributes.
3066 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
3067 for the mapping definition.
3068 "Args" sequence of Sequence of mappings of the
3069 mapping kernel arguments. See
3070 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
3071 for the definition of the mapping.
3072 "CodeProps" mapping Mapping of properties related to
3073 the kernel code. See
3074 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
3075 for the mapping definition.
3076 ================= ============== ========= ================================
3080 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
3081 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
3083 =================== ============== ========= ==============================
3084 String Key Value Type Required? Description
3085 =================== ============== ========= ==============================
3086 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
3087 3 integers must be >=1 and the dispatch
3088 work-group size X, Y, Z must
3089 correspond to the specified
3090 values. Defaults to 0, 0, 0.
3092 Corresponds to the OpenCL
3093 ``reqd_work_group_size``
3095 "WorkGroupSizeHint" sequence of The dispatch work-group size
3096 3 integers X, Y, Z is likely to be the
3099 Corresponds to the OpenCL
3100 ``work_group_size_hint``
3102 "VecTypeHint" string The name of a scalar or vector
3105 Corresponds to the OpenCL
3106 ``vec_type_hint`` attribute.
3108 "RuntimeHandle" string The external symbol name
3109 associated with a kernel.
3110 OpenCL runtime allocates a
3111 global buffer for the symbol
3112 and saves the kernel's address
3113 to it, which is used for
3114 device side enqueueing. Only
3115 available for device side
3117 =================== ============== ========= ==============================
3121 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
3122 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
3124 ================= ============== ========= ================================
3125 String Key Value Type Required? Description
3126 ================= ============== ========= ================================
3127 "Name" string Kernel argument name.
3128 "TypeName" string Kernel argument type name.
3129 "Size" integer Required Kernel argument size in bytes.
3130 "Align" integer Required Kernel argument alignment in
3131 bytes. Must be a power of two.
3132 "ValueKind" string Required Kernel argument kind that
3133 specifies how to set up the
3134 corresponding argument.
3138 The argument is copied
3139 directly into the kernarg.
3142 A global address space pointer
3143 to the buffer data is passed
3146 "DynamicSharedPointer"
3147 A group address space pointer
3148 to dynamically allocated LDS
3149 is passed in the kernarg.
3152 A global address space
3153 pointer to a S# is passed in
3157 A global address space
3158 pointer to a T# is passed in
3162 A global address space pointer
3163 to an OpenCL pipe is passed in
3167 A global address space pointer
3168 to an OpenCL device enqueue
3169 queue is passed in the
3172 "HiddenGlobalOffsetX"
3173 The OpenCL grid dispatch
3174 global offset for the X
3175 dimension is passed in the
3178 "HiddenGlobalOffsetY"
3179 The OpenCL grid dispatch
3180 global offset for the Y
3181 dimension is passed in the
3184 "HiddenGlobalOffsetZ"
3185 The OpenCL grid dispatch
3186 global offset for the Z
3187 dimension is passed in the
3191 An argument that is not used
3192 by the kernel. Space needs to
3193 be left for it, but it does
3194 not need to be set up.
3196 "HiddenPrintfBuffer"
3197 A global address space pointer
3198 to the runtime printf buffer
3199 is passed in kernarg. Mutually
3201 "HiddenHostcallBuffer".
3203 "HiddenHostcallBuffer"
3204 A global address space pointer
3205 to the runtime hostcall buffer
3206 is passed in kernarg. Mutually
3208 "HiddenPrintfBuffer".
3210 "HiddenDefaultQueue"
3211 A global address space pointer
3212 to the OpenCL device enqueue
3213 queue that should be used by
3214 the kernel by default is
3215 passed in the kernarg.
3217 "HiddenCompletionAction"
3218 A global address space pointer
3219 to help link enqueued kernels into
3220 the ancestor tree for determining
3221 when the parent kernel has finished.
3223 "HiddenMultiGridSyncArg"
3224 A global address space pointer for
3225 multi-grid synchronization is
3226 passed in the kernarg.
3228 "ValueType" string Unused and deprecated. This should no longer
3229 be emitted, but is accepted for compatibility.
3232 "PointeeAlign" integer Alignment in bytes of pointee
3233 type for pointer type kernel
3234 argument. Must be a power
3235 of 2. Only present if
3237 "DynamicSharedPointer".
3238 "AddrSpaceQual" string Kernel argument address space
3239 qualifier. Only present if
3240 "ValueKind" is "GlobalBuffer" or
3241 "DynamicSharedPointer". Values
3253 Is GlobalBuffer only Global
3255 DynamicSharedPointer always
3256 Local? Can HCC allow Generic?
3257 How can Private or Region
3260 "AccQual" string Kernel argument access
3261 qualifier. Only present if
3262 "ValueKind" is "Image" or
3275 "ActualAccQual" string The actual memory accesses
3276 performed by the kernel on the
3277 kernel argument. Only present if
3278 "ValueKind" is "GlobalBuffer",
3279 "Image", or "Pipe". This may be
3280 more restrictive than indicated
3281 by "AccQual" to reflect what the
3282 kernel actual does. If not
3283 present then the runtime must
3284 assume what is implied by
3285 "AccQual" and "IsConst". Values
3292 "IsConst" boolean Indicates if the kernel argument
3293 is const qualified. Only present
3297 "IsRestrict" boolean Indicates if the kernel argument
3298 is restrict qualified. Only
3299 present if "ValueKind" is
3302 "IsVolatile" boolean Indicates if the kernel argument
3303 is volatile qualified. Only
3304 present if "ValueKind" is
3307 "IsPipe" boolean Indicates if the kernel argument
3308 is pipe qualified. Only present
3309 if "ValueKind" is "Pipe".
3313 Can GlobalBuffer be pipe
3316 ================= ============== ========= ================================
3320 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
3321 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
3323 ============================ ============== ========= =====================
3324 String Key Value Type Required? Description
3325 ============================ ============== ========= =====================
3326 "KernargSegmentSize" integer Required The size in bytes of
3328 that holds the values
3331 "GroupSegmentFixedSize" integer Required The amount of group
3335 bytes. This does not
3337 dynamically allocated
3338 group segment memory
3342 "PrivateSegmentFixedSize" integer Required The amount of fixed
3343 private address space
3344 memory required for a
3346 bytes. If the kernel
3348 stack then additional
3350 to this value for the
3352 "KernargSegmentAlign" integer Required The maximum byte
3355 kernarg segment. Must
3357 "WavefrontSize" integer Required Wavefront size. Must
3359 "NumSGPRs" integer Required Number of scalar
3363 includes the special
3365 Scratch (GFX7-GFX10)
3367 GFX8-GFX10). It does
3369 SGPR added if a trap
3375 "NumVGPRs" integer Required Number of vector
3379 "MaxFlatWorkGroupSize" integer Required Maximum flat
3382 kernel in work-items.
3385 ReqdWorkGroupSize if
3387 "NumSpilledSGPRs" integer Number of stores from
3388 a scalar register to
3389 a register allocator
3392 "NumSpilledVGPRs" integer Number of stores from
3393 a vector register to
3394 a register allocator
3397 ============================ ============== ========= =====================
3399 .. _amdgpu-amdhsa-code-object-metadata-v3:
3401 Code Object V3 Metadata
3402 +++++++++++++++++++++++
3405 Code object V3 is not the default code object version emitted by this version
3408 Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3409 record (see :ref:`amdgpu-note-records-v3-onwards`).
3411 The metadata is represented as Message Pack formatted binary data (see
3412 [MsgPack]_). The top level is a Message Pack map that includes the
3413 keys defined in table
3414 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3417 Additional information can be added to the maps. To avoid conflicts,
3418 any key names should be prefixed by "*vendor-name*." where
3419 ``vendor-name`` can be the name of the vendor and specific vendor
3420 tool that generates the information. The prefix is abbreviated to
3421 simply "." when it appears within a map that has been added by the
3424 .. table:: AMDHSA Code Object V3 Metadata Map
3425 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3427 ================= ============== ========= =======================================
3428 String Key Value Type Required? Description
3429 ================= ============== ========= =======================================
3430 "amdhsa.version" sequence of Required - The first integer is the major
3431 2 integers version. Currently 1.
3432 - The second integer is the minor
3433 version. Currently 0.
3434 "amdhsa.printf" sequence of Each string is encoded information
3435 strings about a printf function call. The
3436 encoded information is organized as
3437 fields separated by colon (':'):
3439 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3444 A 32-bit integer as a unique id for
3445 each printf function call
3448 A 32-bit integer equal to the number
3449 of arguments of printf function call
3452 ``S[i]`` (where i = 0, 1, ... , N-1)
3453 32-bit integers for the size in bytes
3454 of the i-th FormatString argument of
3455 the printf function call
3458 The format string passed to the
3459 printf function call.
3460 "amdhsa.kernels" sequence of Required Sequence of the maps for each
3461 map kernel in the code object. See
3462 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3463 for the definition of the keys included
3465 ================= ============== ========= =======================================
3469 .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3470 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3472 =================================== ============== ========= ================================
3473 String Key Value Type Required? Description
3474 =================================== ============== ========= ================================
3475 ".name" string Required Source name of the kernel.
3476 ".symbol" string Required Name of the kernel
3477 descriptor ELF symbol.
3478 ".language" string Source language of the kernel.
3488 ".language_version" sequence of - The first integer is the major
3490 - The second integer is the
3492 ".args" sequence of Sequence of maps of the
3493 map kernel arguments. See
3494 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3495 for the definition of the keys
3496 included in that map.
3497 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
3498 3 integers must be >=1 and the dispatch
3499 work-group size X, Y, Z must
3500 correspond to the specified
3501 values. Defaults to 0, 0, 0.
3503 Corresponds to the OpenCL
3504 ``reqd_work_group_size``
3506 ".workgroup_size_hint" sequence of The dispatch work-group size
3507 3 integers X, Y, Z is likely to be the
3510 Corresponds to the OpenCL
3511 ``work_group_size_hint``
3513 ".vec_type_hint" string The name of a scalar or vector
3516 Corresponds to the OpenCL
3517 ``vec_type_hint`` attribute.
3519 ".device_enqueue_symbol" string The external symbol name
3520 associated with a kernel.
3521 OpenCL runtime allocates a
3522 global buffer for the symbol
3523 and saves the kernel's address
3524 to it, which is used for
3525 device side enqueueing. Only
3526 available for device side
3528 ".kernarg_segment_size" integer Required The size in bytes of
3530 that holds the values
3533 ".group_segment_fixed_size" integer Required The amount of group
3537 bytes. This does not
3539 dynamically allocated
3540 group segment memory
3544 ".private_segment_fixed_size" integer Required The amount of fixed
3545 private address space
3546 memory required for a
3548 bytes. If the kernel
3550 stack then additional
3552 to this value for the
3554 ".kernarg_segment_align" integer Required The maximum byte
3557 kernarg segment. Must
3559 ".wavefront_size" integer Required Wavefront size. Must
3561 ".sgpr_count" integer Required Number of scalar
3562 registers required by a
3564 GFX6-GFX9. A register
3565 is required if it is
3567 if a higher numbered
3570 includes the special
3576 SGPR added if a trap
3582 ".vgpr_count" integer Required Number of vector
3583 registers required by
3585 GFX6-GFX9. A register
3586 is required if it is
3588 if a higher numbered
3591 ".agpr_count" integer Required Number of accumulator
3592 registers required by
3595 ".max_flat_workgroup_size" integer Required Maximum flat
3598 kernel in work-items.
3601 ReqdWorkGroupSize if
3603 ".sgpr_spill_count" integer Number of stores from
3604 a scalar register to
3605 a register allocator
3608 ".vgpr_spill_count" integer Number of stores from
3609 a vector register to
3610 a register allocator
3613 ".kind" string The kind of the kernel
3621 These kernels must be
3622 invoked after loading
3632 These kernels must be
3635 containing code object
3636 and after all init and
3637 normal kernels in the
3638 same code object have
3642 If omitted, "normal" is
3644 =================================== ============== ========= ================================
3648 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3649 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3651 ====================== ============== ========= ================================
3652 String Key Value Type Required? Description
3653 ====================== ============== ========= ================================
3654 ".name" string Kernel argument name.
3655 ".type_name" string Kernel argument type name.
3656 ".size" integer Required Kernel argument size in bytes.
3657 ".offset" integer Required Kernel argument offset in
3658 bytes. The offset must be a
3659 multiple of the alignment
3660 required by the argument.
3661 ".value_kind" string Required Kernel argument kind that
3662 specifies how to set up the
3663 corresponding argument.
3667 The argument is copied
3668 directly into the kernarg.
3671 A global address space pointer
3672 to the buffer data is passed
3675 "dynamic_shared_pointer"
3676 A group address space pointer
3677 to dynamically allocated LDS
3678 is passed in the kernarg.
3681 A global address space
3682 pointer to a S# is passed in
3686 A global address space
3687 pointer to a T# is passed in
3691 A global address space pointer
3692 to an OpenCL pipe is passed in
3696 A global address space pointer
3697 to an OpenCL device enqueue
3698 queue is passed in the
3701 "hidden_global_offset_x"
3702 The OpenCL grid dispatch
3703 global offset for the X
3704 dimension is passed in the
3707 "hidden_global_offset_y"
3708 The OpenCL grid dispatch
3709 global offset for the Y
3710 dimension is passed in the
3713 "hidden_global_offset_z"
3714 The OpenCL grid dispatch
3715 global offset for the Z
3716 dimension is passed in the
3720 An argument that is not used
3721 by the kernel. Space needs to
3722 be left for it, but it does
3723 not need to be set up.
3725 "hidden_printf_buffer"
3726 A global address space pointer
3727 to the runtime printf buffer
3728 is passed in kernarg. Mutually
3730 "hidden_hostcall_buffer"
3731 before Code Object V5.
3733 "hidden_hostcall_buffer"
3734 A global address space pointer
3735 to the runtime hostcall buffer
3736 is passed in kernarg. Mutually
3738 "hidden_printf_buffer"
3739 before Code Object V5.
3741 "hidden_default_queue"
3742 A global address space pointer
3743 to the OpenCL device enqueue
3744 queue that should be used by
3745 the kernel by default is
3746 passed in the kernarg.
3748 "hidden_completion_action"
3749 A global address space pointer
3750 to help link enqueued kernels into
3751 the ancestor tree for determining
3752 when the parent kernel has finished.
3754 "hidden_multigrid_sync_arg"
3755 A global address space pointer for
3756 multi-grid synchronization is
3757 passed in the kernarg.
3759 ".value_type" string Unused and deprecated. This should no longer
3760 be emitted, but is accepted for compatibility.
3762 ".pointee_align" integer Alignment in bytes of pointee
3763 type for pointer type kernel
3764 argument. Must be a power
3765 of 2. Only present if
3767 "dynamic_shared_pointer".
3768 ".address_space" string Kernel argument address space
3769 qualifier. Only present if
3770 ".value_kind" is "global_buffer" or
3771 "dynamic_shared_pointer". Values
3783 Is "global_buffer" only "global"
3785 "dynamic_shared_pointer" always
3786 "local"? Can HCC allow "generic"?
3787 How can "private" or "region"
3790 ".access" string Kernel argument access
3791 qualifier. Only present if
3792 ".value_kind" is "image" or
3805 ".actual_access" string The actual memory accesses
3806 performed by the kernel on the
3807 kernel argument. Only present if
3808 ".value_kind" is "global_buffer",
3809 "image", or "pipe". This may be
3810 more restrictive than indicated
3811 by ".access" to reflect what the
3812 kernel actual does. If not
3813 present then the runtime must
3814 assume what is implied by
3815 ".access" and ".is_const" . Values
3822 ".is_const" boolean Indicates if the kernel argument
3823 is const qualified. Only present
3827 ".is_restrict" boolean Indicates if the kernel argument
3828 is restrict qualified. Only
3829 present if ".value_kind" is
3832 ".is_volatile" boolean Indicates if the kernel argument
3833 is volatile qualified. Only
3834 present if ".value_kind" is
3837 ".is_pipe" boolean Indicates if the kernel argument
3838 is pipe qualified. Only present
3839 if ".value_kind" is "pipe".
3843 Can "global_buffer" be pipe
3846 ====================== ============== ========= ================================
3848 .. _amdgpu-amdhsa-code-object-metadata-v4:
3850 Code Object V4 Metadata
3851 +++++++++++++++++++++++
3853 Code object V4 metadata is the same as
3854 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3855 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
3857 .. table:: AMDHSA Code Object V4 Metadata Map Changes
3858 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3860 ================= ============== ========= =======================================
3861 String Key Value Type Required? Description
3862 ================= ============== ========= =======================================
3863 "amdhsa.version" sequence of Required - The first integer is the major
3864 2 integers version. Currently 1.
3865 - The second integer is the minor
3866 version. Currently 1.
3867 "amdhsa.target" string Required The target name of the code using the syntax:
3871 <target-triple> [ "-" <target-id> ]
3873 A canonical target ID must be
3874 used. See :ref:`amdgpu-target-triples`
3875 and :ref:`amdgpu-target-id`.
3876 ================= ============== ========= =======================================
3878 .. _amdgpu-amdhsa-code-object-metadata-v5:
3880 Code Object V5 Metadata
3881 +++++++++++++++++++++++
3884 Code object V5 is not the default code object version emitted by this version
3888 Code object V5 metadata is the same as
3889 :ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
3890 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5`, table
3891 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5` and table
3892 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
3894 .. table:: AMDHSA Code Object V5 Metadata Map Changes
3895 :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
3897 ================= ============== ========= =======================================
3898 String Key Value Type Required? Description
3899 ================= ============== ========= =======================================
3900 "amdhsa.version" sequence of Required - The first integer is the major
3901 2 integers version. Currently 1.
3902 - The second integer is the minor
3903 version. Currently 2.
3904 ================= ============== ========= =======================================
3908 .. table:: AMDHSA Code Object V5 Kernel Metadata Map Additions
3909 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5
3911 ============================= ============= ========== =======================================
3912 String Key Value Type Required? Description
3913 ============================= ============= ========== =======================================
3914 ".uses_dynamic_stack" boolean Indicates if the generated machine code
3915 is using a dynamically sized stack.
3916 ".workgroup_processor_mode" boolean (GFX10+) Controls ENABLE_WGP_MODE in
3917 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3918 ============================= ============= ========== =======================================
3922 .. table:: AMDHSA Code Object V5 Kernel Attribute Metadata Map
3923 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v5-table
3925 =========================== ============== ========= ==============================
3926 String Key Value Type Required? Description
3927 =========================== ============== ========= ==============================
3928 ".uniform_work_group_size" integer Indicates if the kernel
3929 requires that each dimension
3930 of global size is a multiple
3931 of corresponding dimension of
3932 work-group size. Value of 1
3933 implies true and value of 0
3934 implies false. Metadata is
3935 only emitted when value is 1.
3936 =========================== ============== ========= ==============================
3942 .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
3943 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
3945 ====================== ============== ========= ================================
3946 String Key Value Type Required? Description
3947 ====================== ============== ========= ================================
3948 ".value_kind" string Required Kernel argument kind that
3949 specifies how to set up the
3950 corresponding argument.
3952 the same as code object V3 metadata
3953 (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
3954 with the following additions:
3956 "hidden_block_count_x"
3957 The grid dispatch work-group count for the X dimension
3958 is passed in the kernarg. Some languages, such as OpenCL,
3959 support a last work-group in each dimension being partial.
3960 This count only includes the non-partial work-group count.
3961 This is not the same as the value in the AQL dispatch packet,
3962 which has the grid size in work-items.
3964 "hidden_block_count_y"
3965 The grid dispatch work-group count for the Y dimension
3966 is passed in the kernarg. Some languages, such as OpenCL,
3967 support a last work-group in each dimension being partial.
3968 This count only includes the non-partial work-group count.
3969 This is not the same as the value in the AQL dispatch packet,
3970 which has the grid size in work-items. If the grid dimensionality
3971 is 1, then must be 1.
3973 "hidden_block_count_z"
3974 The grid dispatch work-group count for the Z dimension
3975 is passed in the kernarg. Some languages, such as OpenCL,
3976 support a last work-group in each dimension being partial.
3977 This count only includes the non-partial work-group count.
3978 This is not the same as the value in the AQL dispatch packet,
3979 which has the grid size in work-items. If the grid dimensionality
3980 is 1 or 2, then must be 1.
3982 "hidden_group_size_x"
3983 The grid dispatch work-group size for the X dimension is
3984 passed in the kernarg. This size only applies to the
3985 non-partial work-groups. This is the same value as the AQL
3986 dispatch packet work-group size.
3988 "hidden_group_size_y"
3989 The grid dispatch work-group size for the Y dimension is
3990 passed in the kernarg. This size only applies to the
3991 non-partial work-groups. This is the same value as the AQL
3992 dispatch packet work-group size. If the grid dimensionality
3993 is 1, then must be 1.
3995 "hidden_group_size_z"
3996 The grid dispatch work-group size for the Z dimension is
3997 passed in the kernarg. This size only applies to the
3998 non-partial work-groups. This is the same value as the AQL
3999 dispatch packet work-group size. If the grid dimensionality
4000 is 1 or 2, then must be 1.
4002 "hidden_remainder_x"
4003 The grid dispatch work group size of the partial work group
4004 of the X dimension, if it exists. Must be zero if a partial
4005 work group does not exist in the X dimension.
4007 "hidden_remainder_y"
4008 The grid dispatch work group size of the partial work group
4009 of the Y dimension, if it exists. Must be zero if a partial
4010 work group does not exist in the Y dimension.
4012 "hidden_remainder_z"
4013 The grid dispatch work group size of the partial work group
4014 of the Z dimension, if it exists. Must be zero if a partial
4015 work group does not exist in the Z dimension.
4018 The grid dispatch dimensionality. This is the same value
4019 as the AQL dispatch packet dimensionality. Must be a value
4023 A global address space pointer to an initialized memory
4024 buffer that conforms to the requirements of the malloc/free
4025 device library V1 version implementation.
4027 "hidden_private_base"
4028 The high 32 bits of the flat addressing private aperture base.
4029 Only used by GFX8 to allow conversion between private segment
4030 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4032 "hidden_shared_base"
4033 The high 32 bits of the flat addressing shared aperture base.
4034 Only used by GFX8 to allow conversion between shared segment
4035 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4038 A global memory address space pointer to the ROCm runtime
4039 ``struct amd_queue_t`` structure for the HSA queue of the
4040 associated dispatch AQL packet. It is only required for pre-GFX9
4041 devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
4043 ====================== ============== ========= ================================
4050 The HSA architected queuing language (AQL) defines a user space memory interface
4051 that can be used to control the dispatch of kernels, in an agent independent
4052 way. An agent can have zero or more AQL queues created for it using an HSA
4053 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
4054 are 64 bytes) can be placed. See the *HSA Platform System Architecture
4055 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
4057 The packet processor of a kernel agent is responsible for detecting and
4058 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
4059 packet processor is implemented by the hardware command processor (CP),
4060 asynchronous dispatch controller (ADC) and shader processor input controller
4063 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
4064 the kernel mode driver to initialize and register the AQL queue with CP.
4066 To dispatch a kernel the following actions are performed. This can occur in the
4067 CPU host program, or from an HSA kernel executing on a GPU.
4069 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
4070 executed is obtained.
4071 2. A pointer to the kernel descriptor (see
4072 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
4073 It must be for a kernel that is contained in a code object that was loaded
4074 by an HSA compatible runtime on the kernel agent with which the AQL queue is
4076 3. Space is allocated for the kernel arguments using the HSA compatible runtime
4077 allocator for a memory region with the kernarg property for the kernel agent
4078 that will execute the kernel. It must be at least 16-byte aligned.
4079 4. Kernel argument values are assigned to the kernel argument memory
4080 allocation. The layout is defined in the *HSA Programmer's Language
4081 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
4082 kernel argument memory in the same way constant memory is accessed. (Note
4083 that the HSA specification allows an implementation to copy the kernel
4084 argument contents to another location that is accessed by the kernel.)
4085 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
4086 runtime api uses 64-bit atomic operations to reserve space in the AQL queue
4087 for the packet. The packet must be set up, and the final write must use an
4088 atomic store release to set the packet kind to ensure the packet contents are
4089 visible to the kernel agent. AQL defines a doorbell signal mechanism to
4090 notify the kernel agent that the AQL queue has been updated. These rules, and
4091 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
4092 System Architecture Specification* [HSA]_.
4093 6. A kernel dispatch packet includes information about the actual dispatch,
4094 such as grid and work-group size, together with information from the code
4095 object about the kernel, such as segment sizes. The HSA compatible runtime
4096 queries on the kernel symbol can be used to obtain the code object values
4097 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
4098 7. CP executes micro-code and is responsible for detecting and setting up the
4099 GPU to execute the wavefronts of a kernel dispatch.
4100 8. CP ensures that when the a wavefront starts executing the kernel machine
4101 code, the scalar general purpose registers (SGPR) and vector general purpose
4102 registers (VGPR) are set up as required by the machine code. The required
4103 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
4104 register state is defined in
4105 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4106 9. The prolog of the kernel machine code (see
4107 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
4108 before continuing executing the machine code that corresponds to the kernel.
4109 10. When the kernel dispatch has completed execution, CP signals the completion
4110 signal specified in the kernel dispatch packet if not 0.
4112 .. _amdgpu-amdhsa-memory-spaces:
4117 The memory space properties are:
4119 .. table:: AMDHSA Memory Spaces
4120 :name: amdgpu-amdhsa-memory-spaces-table
4122 ================= =========== ======== ======= ==================
4123 Memory Space Name HSA Segment Hardware Address NULL Value
4125 ================= =========== ======== ======= ==================
4126 Private private scratch 32 0x00000000
4127 Local group LDS 32 0xFFFFFFFF
4128 Global global global 64 0x0000000000000000
4129 Constant constant *same as 64 0x0000000000000000
4131 Generic flat flat 64 0x0000000000000000
4132 Region N/A GDS 32 *not implemented
4134 ================= =========== ======== ======= ==================
4136 The global and constant memory spaces both use global virtual addresses, which
4137 are the same virtual address space used by the CPU. However, some virtual
4138 addresses may only be accessible to the CPU, some only accessible by the GPU,
4141 Using the constant memory space indicates that the data will not change during
4142 the execution of the kernel. This allows scalar read instructions to be
4143 used. The vector and scalar L1 caches are invalidated of volatile data before
4144 each kernel dispatch execution to allow constant memory to change values between
4147 The local memory space uses the hardware Local Data Store (LDS) which is
4148 automatically allocated when the hardware creates work-groups of wavefronts, and
4149 freed when all the wavefronts of a work-group have terminated. The data store
4150 (DS) instructions can be used to access it.
4152 The private memory space uses the hardware scratch memory support. If the kernel
4153 uses scratch, then the hardware allocates memory that is accessed using
4154 wavefront lane dword (4 byte) interleaving. The mapping used from private
4155 address to physical address is:
4157 ``wavefront-scratch-base +
4158 (private-address * wavefront-size * 4) +
4159 (wavefront-lane-id * 4)``
4161 There are different ways that the wavefront scratch base address is determined
4162 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
4163 memory can be accessed in an interleaved manner using buffer instruction with
4164 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
4165 instructions, or by flat instructions. If each lane of a wavefront accesses the
4166 same private address, the interleaving results in adjacent dwords being accessed
4167 and hence requires fewer cache lines to be fetched. Multi-dword access is not
4168 supported except by flat and scratch instructions in GFX9-GFX11.
4170 The generic address space uses the hardware flat address support available in
4171 GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
4172 local apertures), that are outside the range of addressible global memory, to
4173 map from a flat address to a private or local address.
4175 FLAT instructions can take a flat address and access global, private (scratch)
4176 and group (LDS) memory depending on if the address is within one of the
4177 aperture ranges. Flat access to scratch requires hardware aperture setup and
4178 setup in the kernel prologue (see
4179 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
4180 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
4181 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
4183 To convert between a segment address and a flat address the base address of the
4184 apertures address can be used. For GFX7-GFX8 these are available in the
4185 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
4186 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
4187 GFX9-GFX11 the aperture base addresses are directly available as inline constant
4188 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
4189 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
4190 which makes it easier to convert from flat to segment or segment to flat.
4195 Image and sample handles created by an HSA compatible runtime (see
4196 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
4197 object respectively. In order to support the HSA ``query_sampler`` operations
4198 two extra dwords are used to store the HSA BRIG enumeration values for the
4199 queries that are not trivially deducible from the S# representation.
4204 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
4205 are 64-bit addresses of a structure allocated in memory accessible from both the
4206 CPU and GPU. The structure is defined by the runtime and subject to change
4207 between releases. For example, see [AMD-ROCm-github]_.
4209 .. _amdgpu-amdhsa-hsa-aql-queue:
4214 The HSA AQL queue structure is defined by an HSA compatible runtime (see
4215 :ref:`amdgpu-os`) and subject to change between releases. For example, see
4216 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
4217 certain language features such as the flat address aperture bases. It also
4218 contains fields used by CP such as managing the allocation of scratch memory.
4220 .. _amdgpu-amdhsa-kernel-descriptor:
4225 A kernel descriptor consists of the information needed by CP to initiate the
4226 execution of a kernel, including the entry point address of the machine code
4227 that implements the kernel.
4229 Code Object V3 Kernel Descriptor
4230 ++++++++++++++++++++++++++++++++
4232 CP microcode requires the Kernel descriptor to be allocated on 64-byte
4235 The fields used by CP for code objects before V3 also match those specified in
4236 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
4238 .. table:: Code Object V3 Kernel Descriptor
4239 :name: amdgpu-amdhsa-kernel-descriptor-v3-table
4241 ======= ======= =============================== ============================
4242 Bits Size Field Name Description
4243 ======= ======= =============================== ============================
4244 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
4245 address space memory
4246 required for a work-group
4247 in bytes. This does not
4248 include any dynamically
4249 allocated local address
4250 space memory that may be
4251 added when the kernel is
4253 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
4254 private address space
4255 memory required for a
4256 work-item in bytes. When
4257 this cannot be predicted,
4258 code object v4 and older
4259 sets this value to be
4260 higher than the minimum
4262 95:64 4 bytes KERNARG_SIZE The size of the kernarg
4263 memory pointed to by the
4264 AQL dispatch packet. The
4265 kernarg memory is used to
4266 pass arguments to the
4269 * If the kernarg pointer in
4270 the dispatch packet is NULL
4271 then there are no kernel
4273 * If the kernarg pointer in
4274 the dispatch packet is
4275 not NULL and this value
4276 is 0 then the kernarg
4279 * If the kernarg pointer in
4280 the dispatch packet is
4281 not NULL and this value
4282 is not 0 then the value
4283 specifies the kernarg
4284 memory size in bytes. It
4285 is recommended to provide
4286 a value as it may be used
4287 by CP to optimize making
4289 visible to the kernel
4292 127:96 4 bytes Reserved, must be 0.
4293 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
4296 descriptor to kernel's
4297 entry point instruction
4298 which must be 256 byte
4300 351:272 20 Reserved, must be 0.
4302 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9
4303 Reserved, must be 0.
4306 program settings used by
4308 ``COMPUTE_PGM_RSRC3``
4311 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
4314 program settings used by
4316 ``COMPUTE_PGM_RSRC3``
4319 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
4320 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
4321 program settings used by
4323 ``COMPUTE_PGM_RSRC1``
4326 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
4327 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
4328 program settings used by
4330 ``COMPUTE_PGM_RSRC2``
4333 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
4334 458:448 7 bits *See separate bits below.* Enable the setup of the
4335 SGPR user data registers
4337 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4339 The total number of SGPR
4341 requested must not exceed
4342 16 and match value in
4343 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
4344 Any requests beyond 16
4346 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties*
4348 :ref:`amdgpu-processor-table`
4349 specifies *Architected flat
4350 scratch* then not supported
4352 >449 1 bit ENABLE_SGPR_DISPATCH_PTR
4353 >450 1 bit ENABLE_SGPR_QUEUE_PTR
4354 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR
4355 >452 1 bit ENABLE_SGPR_DISPATCH_ID
4356 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties*
4358 :ref:`amdgpu-processor-table`
4359 specifies *Architected flat
4360 scratch* then not supported
4362 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT
4364 457:455 3 bits Reserved, must be 0.
4365 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9
4366 Reserved, must be 0.
4369 wavefront size 64 mode.
4371 native wavefront size
4373 459 1 bit USES_DYNAMIC_STACK Indicates if the generated
4374 machine code is using a
4375 dynamically sized stack.
4376 This is only set in code
4377 object v5 and later.
4378 463:460 4 bits Reserved, must be 0.
4379 470:464 7 bits KERNARG_PRELOAD_SPEC_LENGTH GFX6-GFX9
4380 - Reserved, must be 0.
4382 - The number of dwords from
4383 the kernarg segment to preload
4384 into User SGPRs before kernel
4386 :ref:`amdgpu-amdhsa-kernarg-preload`).
4387 479:471 9 bits KERNARG_PRELOAD_SPEC_OFFSET GFX6-GFX9
4388 - Reserved, must be 0.
4390 - An offset in dwords into the
4391 kernarg segment to begin
4392 preloading data into User
4394 :ref:`amdgpu-amdhsa-kernarg-preload`).
4395 511:480 4 bytes Reserved, must be 0.
4396 512 **Total size 64 bytes.**
4397 ======= ====================================================================
4401 .. table:: compute_pgm_rsrc1 for GFX6-GFX11
4402 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table
4404 ======= ======= =============================== ===========================================================================
4405 Bits Size Field Name Description
4406 ======= ======= =============================== ===========================================================================
4407 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
4408 blocks used by each work-item;
4409 granularity is device
4414 - max(0, ceil(vgprs_used / 4) - 1)
4417 - vgprs_used = align(arch_vgprs, 4)
4419 - max(0, ceil(vgprs_used / 8) - 1)
4420 GFX10-GFX11 (wavefront size 64)
4422 - max(0, ceil(vgprs_used / 4) - 1)
4423 GFX10-GFX11 (wavefront size 32)
4425 - max(0, ceil(vgprs_used / 8) - 1)
4427 Where vgprs_used is defined
4428 as the highest VGPR number
4429 explicitly referenced plus
4432 Used by CP to set up
4433 ``COMPUTE_PGM_RSRC1.VGPRS``.
4436 :ref:`amdgpu-assembler`
4438 automatically for the
4439 selected processor from
4440 values provided to the
4441 `.amdhsa_kernel` directive
4443 `.amdhsa_next_free_vgpr`
4444 nested directive (see
4445 :ref:`amdhsa-kernel-directives-table`).
4446 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4447 blocks used by a wavefront;
4448 granularity is device
4453 - max(0, ceil(sgprs_used / 8) - 1)
4456 - 2 * max(0, ceil(sgprs_used / 16) - 1)
4458 Reserved, must be 0.
4463 defined as the highest
4464 SGPR number explicitly
4465 referenced plus one, plus
4466 a target specific number
4467 of additional special
4469 FLAT_SCRATCH (GFX7+) and
4470 XNACK_MASK (GFX8+), and
4473 limitations. It does not
4474 include the 16 SGPRs added
4475 if a trap handler is
4479 limitations and special
4480 SGPR layout are defined in
4482 documentation, which can
4484 :ref:`amdgpu-processors`
4487 Used by CP to set up
4488 ``COMPUTE_PGM_RSRC1.SGPRS``.
4491 :ref:`amdgpu-assembler`
4493 automatically for the
4494 selected processor from
4495 values provided to the
4496 `.amdhsa_kernel` directive
4498 `.amdhsa_next_free_sgpr`
4499 and `.amdhsa_reserve_*`
4500 nested directives (see
4501 :ref:`amdhsa-kernel-directives-table`).
4502 11:10 2 bits PRIORITY Must be 0.
4504 Start executing wavefront
4505 at the specified priority.
4507 CP is responsible for
4509 ``COMPUTE_PGM_RSRC1.PRIORITY``.
4510 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
4511 with specified rounding
4514 precision floating point
4517 Floating point rounding
4518 mode values are defined in
4519 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4521 Used by CP to set up
4522 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4523 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
4524 with specified rounding
4525 denorm mode for half/double (16
4526 and 64-bit) floating point
4527 precision floating point
4530 Floating point rounding
4531 mode values are defined in
4532 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4534 Used by CP to set up
4535 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4536 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
4537 with specified denorm mode
4540 precision floating point
4543 Floating point denorm mode
4544 values are defined in
4545 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4547 Used by CP to set up
4548 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4549 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
4550 with specified denorm mode
4552 and 64-bit) floating point
4553 precision floating point
4556 Floating point denorm mode
4557 values are defined in
4558 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4560 Used by CP to set up
4561 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4562 20 1 bit PRIV Must be 0.
4564 Start executing wavefront
4565 in privilege trap handler
4568 CP is responsible for
4570 ``COMPUTE_PGM_RSRC1.PRIV``.
4571 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
4572 with DX10 clamp mode
4573 enabled. Used by the vector
4574 ALU to force DX10 style
4575 treatment of NaN's (when
4576 set, clamp NaN to zero,
4580 Used by CP to set up
4581 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4582 22 1 bit DEBUG_MODE Must be 0.
4584 Start executing wavefront
4585 in single step mode.
4587 CP is responsible for
4589 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4590 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
4592 enabled. Floating point
4593 opcodes that support
4594 exception flag gathering
4595 will quiet and propagate
4596 signaling-NaN inputs per
4597 IEEE 754-2008. Min_dx10 and
4598 max_dx10 become IEEE
4599 754-2008 compliant due to
4600 signaling-NaN propagation
4603 Used by CP to set up
4604 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4605 24 1 bit BULKY Must be 0.
4607 Only one work-group allowed
4608 to execute on a compute
4611 CP is responsible for
4613 ``COMPUTE_PGM_RSRC1.BULKY``.
4614 25 1 bit CDBG_USER Must be 0.
4616 Flag that can be used to
4617 control debugging code.
4619 CP is responsible for
4621 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4622 26 1 bit FP16_OVFL GFX6-GFX8
4623 Reserved, must be 0.
4625 Wavefront starts execution
4626 with specified fp16 overflow
4629 - If 0, fp16 overflow generates
4631 - If 1, fp16 overflow that is the
4632 result of an +/-INF input value
4633 or divide by 0 produces a +/-INF,
4634 otherwise clamps computed
4635 overflow to +/-MAX_FP16 as
4638 Used by CP to set up
4639 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4640 28:27 2 bits Reserved, must be 0.
4641 29 1 bit WGP_MODE GFX6-GFX9
4642 Reserved, must be 0.
4644 - If 0 execute work-groups in
4645 CU wavefront execution mode.
4646 - If 1 execute work-groups on
4647 in WGP wavefront execution mode.
4649 See :ref:`amdgpu-amdhsa-memory-model`.
4651 Used by CP to set up
4652 ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4653 30 1 bit MEM_ORDERED GFX6-GFX9
4654 Reserved, must be 0.
4656 Controls the behavior of the
4657 s_waitcnt's vmcnt and vscnt
4660 - If 0 vmcnt reports completion
4661 of load and atomic with return
4662 out of order with sample
4663 instructions, and the vscnt
4664 reports the completion of
4665 store and atomic without
4667 - If 1 vmcnt reports completion
4668 of load, atomic with return
4669 and sample instructions in
4670 order, and the vscnt reports
4671 the completion of store and
4672 atomic without return in order.
4674 Used by CP to set up
4675 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4676 31 1 bit FWD_PROGRESS GFX6-GFX9
4677 Reserved, must be 0.
4679 - If 0 execute SIMD wavefronts
4680 using oldest first policy.
4681 - If 1 execute SIMD wavefronts to
4682 ensure wavefronts will make some
4685 Used by CP to set up
4686 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4687 32 **Total size 4 bytes**
4688 ======= ===================================================================================================================
4692 .. table:: compute_pgm_rsrc2 for GFX6-GFX11
4693 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table
4695 ======= ======= =============================== ===========================================================================
4696 Bits Size Field Name Description
4697 ======= ======= =============================== ===========================================================================
4698 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the
4700 * If the *Target Properties*
4702 :ref:`amdgpu-processor-table`
4705 scratch* then enable the
4707 wavefront scratch offset
4708 system register (see
4709 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4710 * If the *Target Properties*
4712 :ref:`amdgpu-processor-table`
4713 specifies *Architected
4714 flat scratch* then enable
4716 FLAT_SCRATCH register
4718 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4720 Used by CP to set up
4721 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4722 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
4724 registers requested. This
4725 number must be greater than
4726 or equal to the number of user
4727 data registers enabled.
4729 Used by CP to set up
4730 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4731 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
4734 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4735 which is set by the CP if
4736 the runtime has installed a
4738 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
4739 system SGPR register for
4740 the work-group id in the X
4742 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4744 Used by CP to set up
4745 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4746 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
4747 system SGPR register for
4748 the work-group id in the Y
4750 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4752 Used by CP to set up
4753 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4754 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
4755 system SGPR register for
4756 the work-group id in the Z
4758 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4760 Used by CP to set up
4761 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4762 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
4763 system SGPR register for
4764 work-group information (see
4765 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4767 Used by CP to set up
4768 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4769 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
4770 VGPR system registers used
4771 for the work-item ID.
4772 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4775 Used by CP to set up
4776 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4777 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
4779 Wavefront starts execution
4781 exceptions enabled which
4782 are generated when L1 has
4783 witnessed a thread access
4787 CP is responsible for
4788 filling in the address
4790 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4791 according to what the
4793 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
4795 Wavefront starts execution
4796 with memory violation
4797 exceptions exceptions
4798 enabled which are generated
4799 when a memory violation has
4800 occurred for this wavefront from
4802 (write-to-read-only-memory,
4803 mis-aligned atomic, LDS
4804 address out of range,
4805 illegal address, etc.).
4809 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4810 according to what the
4812 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
4814 CP uses the rounded value
4815 from the dispatch packet,
4816 not this value, as the
4817 dispatch may contain
4818 dynamically allocated group
4819 segment memory. CP writes
4821 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4823 Amount of group segment
4824 (LDS) to allocate for each
4825 work-group. Granularity is
4829 roundup(lds-size / (64 * 4))
4831 roundup(lds-size / (128 * 4))
4833 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
4834 _INVALID_OPERATION with specified exceptions
4837 Used by CP to set up
4838 ``COMPUTE_PGM_RSRC2.EXCP_EN``
4839 (set from bits 0..6).
4843 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
4844 _SOURCE input operands is a
4846 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
4847 _DIVISION_BY_ZERO Zero
4848 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
4850 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
4852 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
4854 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
4855 _ZERO (rcp_iflag_f32 instruction
4857 31 1 bit Reserved, must be 0.
4858 32 **Total size 4 bytes.**
4859 ======= ===================================================================================================================
4863 .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
4864 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4866 ======= ======= =============================== ===========================================================================
4867 Bits Size Field Name Description
4868 ======= ======= =============================== ===========================================================================
4869 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4.
4870 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4871 63 - accum-offset = 256.
4872 6:15 10 Reserved, must be 0.
4874 16 1 bit TG_SPLIT - If 0 the waves of a work-group are
4875 launched in the same CU.
4876 - If 1 the waves of a work-group can be
4877 launched in different CUs. The waves
4878 cannot use S_BARRIER or LDS.
4879 17:31 15 Reserved, must be 0.
4881 32 **Total size 4 bytes.**
4882 ======= ===================================================================================================================
4886 .. table:: compute_pgm_rsrc3 for GFX10-GFX11
4887 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table
4889 ======= ======= =============================== ===========================================================================
4890 Bits Size Field Name Description
4891 ======= ======= =============================== ===========================================================================
4892 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPR blocks when executing in subvector mode. For
4893 wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
4894 of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
4895 not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
4896 9:4 6 bits INST_PREF_SIZE GFX10
4897 Reserved, must be 0.
4899 Number of instruction bytes to prefetch, starting at the kernel's entry
4900 point instruction, before wavefront starts execution. The value is 0..63
4901 with a granularity of 128 bytes.
4902 10 1 bit TRAP_ON_START GFX10
4903 Reserved, must be 0.
4907 If 1, wavefront starts execution by trapping into the trap handler.
4909 CP is responsible for filling in the trap on start bit in
4910 ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
4912 11 1 bit TRAP_ON_END GFX10
4913 Reserved, must be 0.
4917 If 1, wavefront execution terminates by trapping into the trap handler.
4919 CP is responsible for filling in the trap on end bit in
4920 ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
4921 30:12 19 bits Reserved, must be 0.
4922 31 1 bit IMAGE_OP GFX10
4923 Reserved, must be 0.
4925 If 1, the kernel execution contains image instructions. If executed as
4926 part of a graphics pipeline, image read instructions will stall waiting
4927 for any necessary ``WAIT_SYNC`` fence to be performed in order to
4928 indicate that earlier pipeline stages have completed writing to the
4931 Not used for compute kernels that are not part of a graphics pipeline and
4933 32 **Total size 4 bytes.**
4934 ======= ===================================================================================================================
4938 .. table:: Floating Point Rounding Mode Enumeration Values
4939 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4941 ====================================== ===== ==============================
4942 Enumeration Name Value Description
4943 ====================================== ===== ==============================
4944 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
4945 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
4946 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
4947 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
4948 ====================================== ===== ==============================
4951 .. table:: Extended FLT_ROUNDS Enumeration Values
4952 :name: amdgpu-rounding-mode-enumeration-values-table
4954 +------------------------+---------------+-------------------+--------------------+----------+
4955 | | F32 NEAR_EVEN | F32 PLUS_INFINITY | F32 MINUS_INFINITY | F32 ZERO |
4956 +------------------------+---------------+-------------------+--------------------+----------+
4957 | F64/F16 NEAR_EVEN | 1 | 11 | 14 | 17 |
4958 +------------------------+---------------+-------------------+--------------------+----------+
4959 | F64/F16 PLUS_INFINITY | 8 | 2 | 15 | 18 |
4960 +------------------------+---------------+-------------------+--------------------+----------+
4961 | F64/F16 MINUS_INFINITY | 9 | 12 | 3 | 19 |
4962 +------------------------+---------------+-------------------+--------------------+----------+
4963 | F64/F16 ZERO | 10 | 13 | 16 | 0 |
4964 +------------------------+---------------+-------------------+--------------------+----------+
4968 .. table:: Floating Point Denorm Mode Enumeration Values
4969 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4971 ====================================== ===== ====================================
4972 Enumeration Name Value Description
4973 ====================================== ===== ====================================
4974 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination Denorms
4975 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
4976 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
4977 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
4978 ====================================== ===== ====================================
4980 Denormal flushing is sign respecting. i.e. the behavior expected by
4981 ``"denormal-fp-math"="preserve-sign"``. The behavior is undefined with
4982 ``"denormal-fp-math"="positive-zero"``
4986 .. table:: System VGPR Work-Item ID Enumeration Values
4987 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4989 ======================================== ===== ============================
4990 Enumeration Name Value Description
4991 ======================================== ===== ============================
4992 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
4994 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
4996 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
4998 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
4999 ======================================== ===== ============================
5001 .. _amdgpu-amdhsa-initial-kernel-execution-state:
5003 Initial Kernel Execution State
5004 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5006 This section defines the register state that will be set up by the packet
5007 processor prior to the start of execution of every wavefront. This is limited by
5008 the constraints of the hardware controllers of CP/ADC/SPI.
5010 The order of the SGPR registers is defined, but the compiler can specify which
5011 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
5012 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
5013 for enabled registers are dense starting at SGPR0: the first enabled register is
5014 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
5017 The initial SGPRs comprise up to 16 User SGPRs that are set by CP and apply to
5018 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
5019 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
5020 actually initialized. These are then immediately followed by the System SGPRs
5021 that are set up by ADC/SPI and can have different values for each wavefront of
5024 SGPR register initial state is defined in
5025 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
5027 .. table:: SGPR Register Set Up Order
5028 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
5030 ========== ========================== ====== ==============================
5031 SGPR Order Name Number Description
5032 (kernel descriptor enable of
5034 ========== ========================== ====== ==============================
5035 First Private Segment Buffer 4 See
5036 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
5038 then Dispatch Ptr 2 64-bit address of AQL dispatch
5039 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
5041 then Queue Ptr 2 64-bit address of amd_queue_t
5042 (enable_sgpr_queue_ptr) object for AQL queue on which
5043 the dispatch packet was
5045 then Kernarg Segment Ptr 2 64-bit address of Kernarg
5046 (enable_sgpr_kernarg segment. This is directly
5047 _segment_ptr) copied from the
5048 kernarg_address in the kernel
5051 Having CP load it once avoids
5052 loading it at the beginning of
5054 then Dispatch Id 2 64-bit Dispatch ID of the
5055 (enable_sgpr_dispatch_id) dispatch packet being
5057 then Flat Scratch Init 2 See
5058 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5060 then Preloaded Kernargs N/A See
5061 (kernarg_preload_spec :ref:`amdgpu-amdhsa-kernarg-preload`.
5063 then Private Segment Size 1 The 32-bit byte size of a
5064 (enable_sgpr_private single work-item's memory
5065 _segment_size) allocation. This is the
5066 value from the kernel
5067 dispatch packet Private
5068 Segment Byte Size rounded up
5069 by CP to a multiple of
5072 Having CP load it once avoids
5073 loading it at the beginning of
5076 This is not used for
5077 GFX7-GFX8 since it is the same
5078 value as the second SGPR of
5079 Flat Scratch Init. However, it
5080 may be needed for GFX9-GFX11 which
5081 changes the meaning of the
5082 Flat Scratch Init value.
5083 then Work-Group Id X 1 32-bit work-group id in X
5084 (enable_sgpr_workgroup_id dimension of grid for
5086 then Work-Group Id Y 1 32-bit work-group id in Y
5087 (enable_sgpr_workgroup_id dimension of grid for
5089 then Work-Group Id Z 1 32-bit work-group id in Z
5090 (enable_sgpr_workgroup_id dimension of grid for
5092 then Work-Group Info 1 {first_wavefront, 14'b0000,
5093 (enable_sgpr_workgroup ordered_append_term[10:0],
5094 _info) threadgroup_size_in_wavefronts[5:0]}
5095 then Scratch Wavefront Offset 1 See
5096 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5097 _segment_wavefront_offset) and
5098 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
5099 ========== ========================== ====== ==============================
5101 The order of the VGPR registers is defined, but the compiler can specify which
5102 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
5103 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
5104 for enabled registers are dense starting at VGPR0: the first enabled register is
5105 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
5108 There are different methods used for the VGPR initial state:
5110 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
5111 specifies otherwise, a separate VGPR register is used per work-item ID. The
5112 VGPR register initial state for this method is defined in
5113 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
5114 * If *Target Properties* column of :ref:`amdgpu-processor-table`
5115 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
5116 for all work-item IDs. The register layout for this method is defined in
5117 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
5119 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
5120 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
5122 ========== ========================== ====== ==============================
5123 VGPR Order Name Number Description
5124 (kernel descriptor enable of
5126 ========== ========================== ====== ==============================
5127 First Work-Item Id X 1 32-bit work-item id in X
5128 (Always initialized) dimension of work-group for
5130 then Work-Item Id Y 1 32-bit work-item id in Y
5131 (enable_vgpr_workitem_id dimension of work-group for
5132 > 0) wavefront lane.
5133 then Work-Item Id Z 1 32-bit work-item id in Z
5134 (enable_vgpr_workitem_id dimension of work-group for
5135 > 1) wavefront lane.
5136 ========== ========================== ====== ==============================
5140 .. table:: Register Layout for Packed Work-Item ID Method
5141 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
5143 ======= ======= ================ =========================================
5144 Bits Size Field Name Description
5145 ======= ======= ================ =========================================
5146 0:9 10 bits Work-Item Id X Work-item id in X
5147 dimension of work-group for
5152 10:19 10 bits Work-Item Id Y Work-item id in Y
5153 dimension of work-group for
5156 Initialized if enable_vgpr_workitem_id >
5157 0, otherwise set to 0.
5158 20:29 10 bits Work-Item Id Z Work-item id in Z
5159 dimension of work-group for
5162 Initialized if enable_vgpr_workitem_id >
5163 1, otherwise set to 0.
5164 30:31 2 bits Reserved, set to 0.
5165 ======= ======= ================ =========================================
5167 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
5169 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
5171 2. Work-group Id registers X, Y, Z are set by ADC which supports any
5172 combination including none.
5173 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
5174 its value cannot be included with the flat scratch init value which is per
5175 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
5176 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
5178 5. Flat Scratch register pair initialization is described in
5179 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
5181 The global segment can be accessed either using buffer instructions (GFX6 which
5182 has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
5183 instructions (GFX9-GFX11).
5185 If buffer operations are used, then the compiler can generate a V# with the
5186 following properties:
5190 * ATC: 1 if IOMMU present (such as APU)
5192 * MTYPE set to support memory coherence that matches the runtime (such as CC for
5193 APU and NC for dGPU).
5195 .. _amdgpu-amdhsa-kernarg-preload:
5197 Preloaded Kernel Arguments
5198 ++++++++++++++++++++++++++
5200 On hardware that supports this feature, kernel arguments can be preloaded into
5201 User SGPRs, up to the maximum number of User SGPRs available. The allocation of
5202 Preload SGPRs occurs directly after the last enabled non-kernarg preload User
5203 SGPR. (See :ref:`amdgpu-amdhsa-initial-kernel-execution-state`)
5205 The data preloaded is copied from the kernarg segment, the amount of data is
5206 determined by the value specified in the kernarg_preload_spec_length field of
5207 the kernel descriptor. This data is then loaded into consecutive User SGPRs. The
5208 number of SGPRs receiving preloaded kernarg data corresponds with the value
5209 given by kernarg_preload_spec_length. The preloading starts at the dword offset
5210 within the kernarg segment, which is specified by the
5211 kernarg_preload_spec_offset field.
5213 If the kernarg_preload_spec_length is non-zero, the CP firmware will append an
5214 additional 256 bytes to the kernel_code_entry_byte_offset. This addition
5215 facilitates the incorporation of a prologue to the kernel entry to handle cases
5216 where code designed for kernarg preloading is executed on hardware equipped with
5217 incompatible firmware. If hardware has compatible firmware the 256 bytes at the
5218 start of the kernel entry will be skipped.
5220 .. _amdgpu-amdhsa-kernel-prolog:
5225 The compiler performs initialization in the kernel prologue depending on the
5226 target and information about things like stack usage in the kernel and called
5227 functions. Some of this initialization requires the compiler to request certain
5228 User and System SGPRs be present in the
5229 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
5230 :ref:`amdgpu-amdhsa-kernel-descriptor`.
5232 .. _amdgpu-amdhsa-kernel-prolog-cfi:
5237 1. The CFI return address is undefined.
5239 2. The CFI CFA is defined using an expression which evaluates to a location
5240 description that comprises one memory location description for the
5241 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
5243 .. _amdgpu-amdhsa-kernel-prolog-m0:
5249 The M0 register must be initialized with a value at least the total LDS size
5250 if the kernel may access LDS via DS or flat operations. Total LDS size is
5251 available in dispatch packet. For M0, it is also possible to use maximum
5252 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
5255 The M0 register is not used for range checking LDS accesses and so does not
5256 need to be initialized in the prolog.
5258 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
5263 If the kernel has function calls it must set up the ABI stack pointer described
5264 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
5265 SGPR32 to the unswizzled scratch offset of the address past the last local
5268 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
5273 If the kernel needs a frame pointer for the reasons defined in
5274 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
5275 kernel prolog. If a frame pointer is not required then all uses of the frame
5276 pointer are replaced with immediate ``0`` offsets.
5278 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
5283 There are different methods used for initializing flat scratch:
5285 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5286 specifies *Does not support generic address space*:
5288 Flat scratch is not supported and there is no flat scratch register pair.
5290 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5291 specifies *Offset flat scratch*:
5293 If the kernel or any function it calls may use flat operations to access
5294 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
5295 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
5296 Scratch Wavefront Offset SGPR registers (see
5297 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
5299 1. The low word of Flat Scratch Init is the 32-bit byte offset from
5300 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
5301 being managed by SPI for the queue executing the kernel dispatch. This is
5302 the same value used in the Scratch Segment Buffer V# base address.
5304 CP obtains this from the runtime. (The Scratch Segment Buffer base address
5305 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
5307 The prolog must add the value of Scratch Wavefront Offset to get the
5308 wavefront's byte scratch backing memory offset from
5309 ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
5311 The Scratch Wavefront Offset must also be used as an offset with Private
5312 segment address when using the Scratch Segment Buffer.
5314 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
5315 shifted by 8 before moving into FLAT_SCRATCH_HI.
5317 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
5318 SGPRn is the highest numbered SGPR allocated to the wavefront).
5319 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
5320 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
5321 FLAT SCRATCH BASE in flat memory instructions that access the scratch
5323 2. The second word of Flat Scratch Init is 32-bit byte size of a single
5324 work-items scratch memory usage.
5326 CP obtains this from the runtime, and it is always a multiple of DWORD. CP
5327 checks that the value in the kernel dispatch packet Private Segment Byte
5328 Size is not larger and requests the runtime to increase the queue's scratch
5331 CP directly loads from the kernel dispatch packet Private Segment Byte Size
5332 field and rounds up to a multiple of DWORD. Having CP load it once avoids
5333 loading it at the beginning of every wavefront.
5335 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
5336 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
5337 in flat memory instructions.
5339 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5340 specifies *Absolute flat scratch*:
5342 If the kernel or any function it calls may use flat operations to access
5343 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
5344 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
5345 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
5346 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
5348 The Flat Scratch Init is the 64-bit address of the base of scratch backing
5349 memory being managed by SPI for the queue executing the kernel dispatch.
5351 CP obtains this from the runtime.
5353 The kernel prolog must add the value of the wave's Scratch Wavefront Offset
5354 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
5355 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
5356 memory instructions.
5358 The Scratch Wavefront Offset must also be used as an offset with Private
5359 segment address when using the Scratch Segment Buffer (see
5360 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
5362 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5363 specifies *Architected flat scratch*:
5365 If ENABLE_PRIVATE_SEGMENT is enabled in
5366 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` then the FLAT_SCRATCH
5367 register pair will be initialized to the 64-bit address of the base of scratch
5368 backing memory being managed by SPI for the queue executing the kernel
5369 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
5370 flat scratch base in flat memory instructions.
5372 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
5374 Private Segment Buffer
5375 ++++++++++++++++++++++
5377 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
5378 *Architected flat scratch* then a Private Segment Buffer is not supported.
5379 Instead the flat SCRATCH instructions are used.
5381 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
5382 that are used as a V# to access scratch. CP uses the value provided by the
5383 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
5384 access the private memory space using a segment address. See
5385 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
5387 The scratch V# is a four-aligned SGPR and always selected for the kernel as
5390 - If it is known during instruction selection that there is stack usage,
5391 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
5392 optimizations are disabled (``-O0``), if stack objects already exist (for
5393 locals, etc.), or if there are any function calls.
5395 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
5396 are reserved for the tentative scratch V#. These will be used if it is
5397 determined that spilling is needed.
5399 - If no use is made of the tentative scratch V#, then it is unreserved,
5400 and the register count is determined ignoring it.
5401 - If use is made of the tentative scratch V#, then its register numbers
5402 are shifted to the first four-aligned SGPR index after the highest one
5403 allocated by the register allocator, and all uses are updated. The
5404 register count includes them in the shifted location.
5405 - In either case, if the processor has the SGPR allocation bug, the
5406 tentative allocation is not shifted or unreserved in order to ensure
5407 the register count is higher to workaround the bug.
5411 This approach of using a tentative scratch V# and shifting the register
5412 numbers if used avoids having to perform register allocation a second
5413 time if the tentative V# is eliminated. This is more efficient and
5414 avoids the problem that the second register allocation may perform
5415 spilling which will fail as there is no longer a scratch V#.
5417 When the kernel prolog code is being emitted it is known whether the scratch V#
5418 described above is actually used. If it is, the prolog code must set it up by
5419 copying the Private Segment Buffer to the scratch V# registers and then adding
5420 the Private Segment Wavefront Offset to the queue base address in the V#. The
5421 result is a V# with a base address pointing to the beginning of the wavefront
5422 scratch backing memory.
5424 The Private Segment Buffer is always requested, but the Private Segment
5425 Wavefront Offset is only requested if it is used (see
5426 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5428 .. _amdgpu-amdhsa-memory-model:
5433 This section describes the mapping of the LLVM memory model onto AMDGPU machine
5434 code (see :ref:`memmodel`).
5436 The AMDGPU backend supports the memory synchronization scopes specified in
5437 :ref:`amdgpu-memory-scopes`.
5439 The code sequences used to implement the memory model specify the order of
5440 instructions that a single thread must execute. The ``s_waitcnt`` and cache
5441 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
5442 to other memory instructions executed by the same thread. This allows them to be
5443 moved earlier or later which can allow them to be combined with other instances
5444 of the same instruction, or hoisted/sunk out of loops to improve performance.
5445 Only the instructions related to the memory model are given; additional
5446 ``s_waitcnt`` instructions are required to ensure registers are defined before
5447 being used. These may be able to be combined with the memory model ``s_waitcnt``
5448 instructions as described above.
5450 The AMDGPU backend supports the following memory models:
5452 HSA Memory Model [HSA]_
5453 The HSA memory model uses a single happens-before relation for all address
5454 spaces (see :ref:`amdgpu-address-spaces`).
5455 OpenCL Memory Model [OpenCL]_
5456 The OpenCL memory model which has separate happens-before relations for the
5457 global and local address spaces. Only a fence specifying both global and
5458 local address space, and seq_cst instructions join the relationships. Since
5459 the LLVM ``memfence`` instruction does not allow an address space to be
5460 specified the OpenCL fence has to conservatively assume both local and
5461 global address space was specified. However, optimizations can often be
5462 done to eliminate the additional ``s_waitcnt`` instructions when there are
5463 no intervening memory instructions which access the corresponding address
5464 space. The code sequences in the table indicate what can be omitted for the
5465 OpenCL memory. The target triple environment is used to determine if the
5466 source language is OpenCL (see :ref:`amdgpu-opencl`).
5468 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
5471 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
5472 termed vector memory operations.
5474 Private address space uses ``buffer_load/store`` using the scratch V#
5475 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
5476 is accessing the memory, atomic memory orderings are not meaningful, and all
5477 accesses are treated as non-atomic.
5479 Constant address space uses ``buffer/global_load`` instructions (or equivalent
5480 scalar memory instructions). Since the constant address space contents do not
5481 change during the execution of a kernel dispatch it is not legal to perform
5482 stores, and atomic memory orderings are not meaningful, and all accesses are
5483 treated as non-atomic.
5485 A memory synchronization scope wider than work-group is not meaningful for the
5486 group (LDS) address space and is treated as work-group.
5488 The memory model does not support the region address space which is treated as
5491 Acquire memory ordering is not meaningful on store atomic instructions and is
5492 treated as non-atomic.
5494 Release memory ordering is not meaningful on load atomic instructions and is
5495 treated a non-atomic.
5497 Acquire-release memory ordering is not meaningful on load or store atomic
5498 instructions and is treated as acquire and release respectively.
5500 The memory order also adds the single thread optimization constraints defined in
5502 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
5504 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
5505 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
5507 ============ ==============================================================
5508 LLVM Memory Optimization Constraints
5510 ============ ==============================================================
5513 acquire - If a load atomic/atomicrmw then no following load/load
5514 atomic/store/store atomic/atomicrmw/fence instruction can be
5515 moved before the acquire.
5516 - If a fence then same as load atomic, plus no preceding
5517 associated fence-paired-atomic can be moved after the fence.
5518 release - If a store atomic/atomicrmw then no preceding load/load
5519 atomic/store/store atomic/atomicrmw/fence instruction can be
5520 moved after the release.
5521 - If a fence then same as store atomic, plus no following
5522 associated fence-paired-atomic can be moved before the
5524 acq_rel Same constraints as both acquire and release.
5525 seq_cst - If a load atomic then same constraints as acquire, plus no
5526 preceding sequentially consistent load atomic/store
5527 atomic/atomicrmw/fence instruction can be moved after the
5529 - If a store atomic then the same constraints as release, plus
5530 no following sequentially consistent load atomic/store
5531 atomic/atomicrmw/fence instruction can be moved before the
5533 - If an atomicrmw/fence then same constraints as acq_rel.
5534 ============ ==============================================================
5536 The code sequences used to implement the memory model are defined in the
5539 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5540 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5541 * :ref:`amdgpu-amdhsa-memory-model-gfx940`
5542 * :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
5544 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5546 Memory Model GFX6-GFX9
5547 ++++++++++++++++++++++
5551 * Each agent has multiple shader arrays (SA).
5552 * Each SA has multiple compute units (CU).
5553 * Each CU has multiple SIMDs that execute wavefronts.
5554 * The wavefronts for a single work-group are executed in the same CU but may be
5555 executed by different SIMDs.
5556 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
5558 * All LDS operations of a CU are performed as wavefront wide operations in a
5559 global order and involve no caching. Completion is reported to a wavefront in
5561 * The LDS memory has multiple request queues shared by the SIMDs of a
5562 CU. Therefore, the LDS operations performed by different wavefronts of a
5563 work-group can be reordered relative to each other, which can result in
5564 reordering the visibility of vector memory operations with respect to LDS
5565 operations of other wavefronts in the same work-group. A ``s_waitcnt
5566 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5567 vector memory operations between wavefronts of a work-group, but not between
5568 operations performed by the same wavefront.
5569 * The vector memory operations are performed as wavefront wide operations and
5570 completion is reported to a wavefront in execution order. The exception is
5571 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5572 vector memory order if they access LDS memory, and out of LDS operation order
5573 if they access global memory.
5574 * The vector memory operations access a single vector L1 cache shared by all
5575 SIMDs a CU. Therefore, no special action is required for coherence between the
5576 lanes of a single wavefront, or for coherence between wavefronts in the same
5577 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5578 wavefronts executing in different work-groups as they may be executing on
5580 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
5581 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5582 scalar operations are used in a restricted way so do not impact the memory
5583 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5584 * The vector and scalar memory operations use an L2 cache shared by all CUs on
5586 * The L2 cache has independent channels to service disjoint ranges of virtual
5588 * Each CU has a separate request queue per channel. Therefore, the vector and
5589 scalar memory operations performed by wavefronts executing in different
5590 work-groups (which may be executing on different CUs) of an agent can be
5591 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
5592 ensure synchronization between vector memory operations of different CUs. It
5593 ensures a previous vector memory operation has completed before executing a
5594 subsequent vector memory or LDS operation and so can be used to meet the
5595 requirements of acquire and release.
5596 * The L2 cache can be kept coherent with other agents on some targets, or ranges
5597 of virtual addresses can be set up to bypass it to ensure system coherence.
5599 Scalar memory operations are only used to access memory that is proven to not
5600 change during the execution of the kernel dispatch. This includes constant
5601 address space and global address space for program scope ``const`` variables.
5602 Therefore, the kernel machine code does not have to maintain the scalar cache to
5603 ensure it is coherent with the vector caches. The scalar and vector caches are
5604 invalidated between kernel dispatches by CP since constant address space data
5605 may change between kernel dispatch executions. See
5606 :ref:`amdgpu-amdhsa-memory-spaces`.
5608 The one exception is if scalar writes are used to spill SGPR registers. In this
5609 case the AMDGPU backend ensures the memory location used to spill is never
5610 accessed by vector memory operations at the same time. If scalar writes are used
5611 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
5612 return since the locations may be used for vector memory instructions by a
5613 future wavefront that uses the same scratch area, or a function call that
5614 creates a frame at the same address, respectively. There is no need for a
5615 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
5617 For kernarg backing memory:
5619 * CP invalidates the L1 cache at the start of each kernel dispatch.
5620 * On dGPU the kernarg backing memory is allocated in host memory accessed as
5621 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
5622 causes it to be treated as non-volatile and so is not invalidated by
5624 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
5625 and so the L2 cache will be coherent with the CPU and other agents.
5627 Scratch backing memory (which is used for the private address space) is accessed
5628 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
5629 only accessed by a single thread, and is always write-before-read, there is
5630 never a need to invalidate these entries from the L1 cache. Hence all cache
5631 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
5633 The code sequences used to implement the memory model for GFX6-GFX9 are defined
5634 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
5636 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
5637 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
5639 ============ ============ ============== ========== ================================
5640 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
5641 Ordering Sync Scope Address GFX6-GFX9
5643 ============ ============ ============== ========== ================================
5645 ------------------------------------------------------------------------------------
5646 load *none* *none* - global - !volatile & !nontemporal
5648 - private 1. buffer/global/flat_load
5650 - !volatile & nontemporal
5652 1. buffer/global/flat_load
5657 1. buffer/global/flat_load
5659 2. s_waitcnt vmcnt(0)
5661 - Must happen before
5662 any following volatile
5673 load *none* *none* - local 1. ds_load
5674 store *none* *none* - global - !volatile & !nontemporal
5676 - private 1. buffer/global/flat_store
5678 - !volatile & nontemporal
5680 1. buffer/global/flat_store
5685 1. buffer/global/flat_store
5686 2. s_waitcnt vmcnt(0)
5688 - Must happen before
5689 any following volatile
5700 store *none* *none* - local 1. ds_store
5701 **Unordered Atomic**
5702 ------------------------------------------------------------------------------------
5703 load atomic unordered *any* *any* *Same as non-atomic*.
5704 store atomic unordered *any* *any* *Same as non-atomic*.
5705 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
5706 **Monotonic Atomic**
5707 ------------------------------------------------------------------------------------
5708 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load
5710 - workgroup - generic
5711 load atomic monotonic - agent - global 1. buffer/global/flat_load
5712 - system - generic glc=1
5713 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
5714 - wavefront - generic
5718 store atomic monotonic - singlethread - local 1. ds_store
5721 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
5722 - wavefront - generic
5726 atomicrmw monotonic - singlethread - local 1. ds_atomic
5730 ------------------------------------------------------------------------------------
5731 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
5734 load atomic acquire - workgroup - global 1. buffer/global_load
5735 load atomic acquire - workgroup - local 1. ds/flat_load
5736 - generic 2. s_waitcnt lgkmcnt(0)
5739 - Must happen before
5748 older than a local load
5752 load atomic acquire - agent - global 1. buffer/global_load
5754 2. s_waitcnt vmcnt(0)
5756 - Must happen before
5764 3. buffer_wbinvl1_vol
5766 - Must happen before
5776 load atomic acquire - agent - generic 1. flat_load glc=1
5777 - system 2. s_waitcnt vmcnt(0) &
5782 - Must happen before
5785 - Ensures the flat_load
5790 3. buffer_wbinvl1_vol
5792 - Must happen before
5802 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
5805 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
5806 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
5807 - generic 2. s_waitcnt lgkmcnt(0)
5810 - Must happen before
5823 atomicrmw acquire - agent - global 1. buffer/global_atomic
5824 - system 2. s_waitcnt vmcnt(0)
5826 - Must happen before
5835 3. buffer_wbinvl1_vol
5837 - Must happen before
5847 atomicrmw acquire - agent - generic 1. flat_atomic
5848 - system 2. s_waitcnt vmcnt(0) &
5853 - Must happen before
5862 3. buffer_wbinvl1_vol
5864 - Must happen before
5874 fence acquire - singlethread *none* *none*
5876 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5881 - However, since LLVM
5906 fence-paired-atomic).
5907 - Must happen before
5918 fence-paired-atomic.
5920 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
5927 - However, since LLVM
5935 - Could be split into
5944 - s_waitcnt vmcnt(0)
5955 fence-paired-atomic).
5956 - s_waitcnt lgkmcnt(0)
5967 fence-paired-atomic).
5968 - Must happen before
5982 fence-paired-atomic.
5984 2. buffer_wbinvl1_vol
5986 - Must happen before any
5987 following global/generic
5997 ------------------------------------------------------------------------------------
5998 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
6001 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
6010 - Must happen before
6021 2. buffer/global/flat_store
6022 store atomic release - workgroup - local 1. ds_store
6023 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
6024 - system - generic vmcnt(0)
6030 - Could be split into
6039 - s_waitcnt vmcnt(0)
6046 - s_waitcnt lgkmcnt(0)
6053 - Must happen before
6064 2. buffer/global/flat_store
6065 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
6068 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
6077 - Must happen before
6088 2. buffer/global/flat_atomic
6089 atomicrmw release - workgroup - local 1. ds_atomic
6090 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
6091 - system - generic vmcnt(0)
6095 - Could be split into
6104 - s_waitcnt vmcnt(0)
6111 - s_waitcnt lgkmcnt(0)
6118 - Must happen before
6129 2. buffer/global/flat_atomic
6130 fence release - singlethread *none* *none*
6132 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6137 - However, since LLVM
6158 - Must happen before
6167 fence-paired-atomic).
6174 fence-paired-atomic.
6176 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
6187 - However, since LLVM
6202 - Could be split into
6211 - s_waitcnt vmcnt(0)
6218 - s_waitcnt lgkmcnt(0)
6225 - Must happen before
6234 fence-paired-atomic).
6241 fence-paired-atomic.
6243 **Acquire-Release Atomic**
6244 ------------------------------------------------------------------------------------
6245 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
6248 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
6257 - Must happen before
6268 2. buffer/global_atomic
6270 atomicrmw acq_rel - workgroup - local 1. ds_atomic
6271 2. s_waitcnt lgkmcnt(0)
6274 - Must happen before
6283 older than the local load
6287 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
6296 - Must happen before
6308 3. s_waitcnt lgkmcnt(0)
6311 - Must happen before
6320 older than a local load
6324 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
6329 - Could be split into
6338 - s_waitcnt vmcnt(0)
6345 - s_waitcnt lgkmcnt(0)
6352 - Must happen before
6363 2. buffer/global_atomic
6364 3. s_waitcnt vmcnt(0)
6366 - Must happen before
6375 4. buffer_wbinvl1_vol
6377 - Must happen before
6387 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
6392 - Could be split into
6401 - s_waitcnt vmcnt(0)
6408 - s_waitcnt lgkmcnt(0)
6415 - Must happen before
6427 3. s_waitcnt vmcnt(0) &
6432 - Must happen before
6441 4. buffer_wbinvl1_vol
6443 - Must happen before
6453 fence acq_rel - singlethread *none* *none*
6455 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6475 - Must happen before
6498 acquire-fence-paired-atomic)
6519 release-fence-paired-atomic).
6524 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
6531 - However, since LLVM
6539 - Could be split into
6548 - s_waitcnt vmcnt(0)
6555 - s_waitcnt lgkmcnt(0)
6562 - Must happen before
6567 global/local/generic
6576 acquire-fence-paired-atomic)
6588 global/local/generic
6597 release-fence-paired-atomic).
6602 2. buffer_wbinvl1_vol
6604 - Must happen before
6618 **Sequential Consistent Atomic**
6619 ------------------------------------------------------------------------------------
6620 load atomic seq_cst - singlethread - global *Same as corresponding
6621 - wavefront - local load atomic acquire,
6622 - generic except must generate
6623 all instructions even
6625 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
6641 lgkmcnt(0) and so do
6673 order. The s_waitcnt
6674 could be placed after
6678 make the s_waitcnt be
6685 instructions same as
6688 except must generate
6689 all instructions even
6691 load atomic seq_cst - workgroup - local *Same as corresponding
6692 load atomic acquire,
6693 except must generate
6694 all instructions even
6697 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
6698 - system - generic vmcnt(0)
6700 - Could be split into
6709 - s_waitcnt lgkmcnt(0)
6722 lgkmcnt(0) and so do
6725 - s_waitcnt vmcnt(0)
6770 order. The s_waitcnt
6771 could be placed after
6775 make the s_waitcnt be
6782 instructions same as
6785 except must generate
6786 all instructions even
6788 store atomic seq_cst - singlethread - global *Same as corresponding
6789 - wavefront - local store atomic release,
6790 - workgroup - generic except must generate
6791 - agent all instructions even
6792 - system for OpenCL.*
6793 atomicrmw seq_cst - singlethread - global *Same as corresponding
6794 - wavefront - local atomicrmw acq_rel,
6795 - workgroup - generic except must generate
6796 - agent all instructions even
6797 - system for OpenCL.*
6798 fence seq_cst - singlethread *none* *Same as corresponding
6799 - wavefront fence acq_rel,
6800 - workgroup except must generate
6801 - agent all instructions even
6802 - system for OpenCL.*
6803 ============ ============ ============== ========== ================================
6805 .. _amdgpu-amdhsa-memory-model-gfx90a:
6812 * Each agent has multiple shader arrays (SA).
6813 * Each SA has multiple compute units (CU).
6814 * Each CU has multiple SIMDs that execute wavefronts.
6815 * The wavefronts for a single work-group are executed in the same CU but may be
6816 executed by different SIMDs. The exception is when in tgsplit execution mode
6817 when the wavefronts may be executed by different SIMDs in different CUs.
6818 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
6819 executing on it. The exception is when in tgsplit execution mode when no LDS
6820 is allocated as wavefronts of the same work-group can be in different CUs.
6821 * All LDS operations of a CU are performed as wavefront wide operations in a
6822 global order and involve no caching. Completion is reported to a wavefront in
6824 * The LDS memory has multiple request queues shared by the SIMDs of a
6825 CU. Therefore, the LDS operations performed by different wavefronts of a
6826 work-group can be reordered relative to each other, which can result in
6827 reordering the visibility of vector memory operations with respect to LDS
6828 operations of other wavefronts in the same work-group. A ``s_waitcnt
6829 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6830 vector memory operations between wavefronts of a work-group, but not between
6831 operations performed by the same wavefront.
6832 * The vector memory operations are performed as wavefront wide operations and
6833 completion is reported to a wavefront in execution order. The exception is
6834 that ``flat_load/store/atomic`` instructions can report out of vector memory
6835 order if they access LDS memory, and out of LDS operation order if they access
6837 * The vector memory operations access a single vector L1 cache shared by all
6838 SIMDs a CU. Therefore:
6840 * No special action is required for coherence between the lanes of a single
6843 * No special action is required for coherence between wavefronts in the same
6844 work-group since they execute on the same CU. The exception is when in
6845 tgsplit execution mode as wavefronts of the same work-group can be in
6846 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6849 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6850 executing in different work-groups as they may be executing on different
6853 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
6854 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6855 scalar operations are used in a restricted way so do not impact the memory
6856 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6857 * The vector and scalar memory operations use an L2 cache shared by all CUs on
6860 * The L2 cache has independent channels to service disjoint ranges of virtual
6862 * Each CU has a separate request queue per channel. Therefore, the vector and
6863 scalar memory operations performed by wavefronts executing in different
6864 work-groups (which may be executing on different CUs), or the same
6865 work-group if executing in tgsplit mode, of an agent can be reordered
6866 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6867 synchronization between vector memory operations of different CUs. It
6868 ensures a previous vector memory operation has completed before executing a
6869 subsequent vector memory or LDS operation and so can be used to meet the
6870 requirements of acquire and release.
6871 * The L2 cache of one agent can be kept coherent with other agents by:
6872 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6873 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6874 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6876 * Any local memory cache lines will be automatically invalidated by writes
6877 from CUs associated with other L2 caches, or writes from the CPU, due to
6878 the cache probe caused by coherent requests. Coherent requests are caused
6879 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6880 XGMI, and by PCIe requests that are configured to be coherent requests.
6881 * XGMI accesses from the CPU to local memory may be cached on the CPU.
6882 Subsequent access from the GPU will automatically invalidate or writeback
6883 the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6884 * Since all work-groups on the same agent share the same L2, no L2
6885 invalidation or writeback is required for coherence.
6886 * To ensure coherence of local and remote memory writes of work-groups in
6887 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6888 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6889 ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6890 fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6891 remote fine grain memory) bypasses the L2, so both will never result in
6892 dirty L2 cache lines.
6893 * To ensure coherence of local and remote memory reads of work-groups in
6894 different agents a ``buffer_invl2`` is required. It will invalidate L2
6895 cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6896 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6897 coarse memory) cause local reads to be invalidated by remote writes with
6898 with the PTE C-bit so these cache lines are not invalidated. Note that
6899 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6900 never result in L2 cache lines that need to be invalidated.
6902 * PCIe access from the GPU to the CPU memory is kept coherent by using the
6903 MTYPE UC (uncached) which bypasses the L2.
6905 Scalar memory operations are only used to access memory that is proven to not
6906 change during the execution of the kernel dispatch. This includes constant
6907 address space and global address space for program scope ``const`` variables.
6908 Therefore, the kernel machine code does not have to maintain the scalar cache to
6909 ensure it is coherent with the vector caches. The scalar and vector caches are
6910 invalidated between kernel dispatches by CP since constant address space data
6911 may change between kernel dispatch executions. See
6912 :ref:`amdgpu-amdhsa-memory-spaces`.
6914 The one exception is if scalar writes are used to spill SGPR registers. In this
6915 case the AMDGPU backend ensures the memory location used to spill is never
6916 accessed by vector memory operations at the same time. If scalar writes are used
6917 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6918 return since the locations may be used for vector memory instructions by a
6919 future wavefront that uses the same scratch area, or a function call that
6920 creates a frame at the same address, respectively. There is no need for a
6921 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6923 For kernarg backing memory:
6925 * CP invalidates the L1 cache at the start of each kernel dispatch.
6926 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6927 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6928 cache. This also causes it to be treated as non-volatile and so is not
6929 invalidated by ``*_vol``.
6930 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6931 so the L2 cache will be coherent with the CPU and other agents.
6933 Scratch backing memory (which is used for the private address space) is accessed
6934 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6935 only accessed by a single thread, and is always write-before-read, there is
6936 never a need to invalidate these entries from the L1 cache. Hence all cache
6937 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6939 The code sequences used to implement the memory model for GFX90A are defined
6940 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6942 .. table:: AMDHSA Memory Model Code Sequences GFX90A
6943 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6945 ============ ============ ============== ========== ================================
6946 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
6947 Ordering Sync Scope Address GFX90A
6949 ============ ============ ============== ========== ================================
6951 ------------------------------------------------------------------------------------
6952 load *none* *none* - global - !volatile & !nontemporal
6954 - private 1. buffer/global/flat_load
6956 - !volatile & nontemporal
6958 1. buffer/global/flat_load
6963 1. buffer/global/flat_load
6965 2. s_waitcnt vmcnt(0)
6967 - Must happen before
6968 any following volatile
6979 load *none* *none* - local 1. ds_load
6980 store *none* *none* - global - !volatile & !nontemporal
6982 - private 1. buffer/global/flat_store
6984 - !volatile & nontemporal
6986 1. buffer/global/flat_store
6991 1. buffer/global/flat_store
6992 2. s_waitcnt vmcnt(0)
6994 - Must happen before
6995 any following volatile
7006 store *none* *none* - local 1. ds_store
7007 **Unordered Atomic**
7008 ------------------------------------------------------------------------------------
7009 load atomic unordered *any* *any* *Same as non-atomic*.
7010 store atomic unordered *any* *any* *Same as non-atomic*.
7011 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
7012 **Monotonic Atomic**
7013 ------------------------------------------------------------------------------------
7014 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
7015 - wavefront - generic
7016 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
7019 - If not TgSplit execution
7022 load atomic monotonic - singlethread - local *If TgSplit execution mode,
7023 - wavefront local address space cannot
7024 - workgroup be used.*
7027 load atomic monotonic - agent - global 1. buffer/global/flat_load
7029 load atomic monotonic - system - global 1. buffer/global/flat_load
7031 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
7032 - wavefront - generic
7035 store atomic monotonic - system - global 1. buffer/global/flat_store
7037 store atomic monotonic - singlethread - local *If TgSplit execution mode,
7038 - wavefront local address space cannot
7039 - workgroup be used.*
7042 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
7043 - wavefront - generic
7046 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
7048 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
7049 - wavefront local address space cannot
7050 - workgroup be used.*
7054 ------------------------------------------------------------------------------------
7055 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
7058 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
7060 - If not TgSplit execution
7063 2. s_waitcnt vmcnt(0)
7065 - If not TgSplit execution
7067 - Must happen before the
7068 following buffer_wbinvl1_vol.
7070 3. buffer_wbinvl1_vol
7072 - If not TgSplit execution
7074 - Must happen before
7085 load atomic acquire - workgroup - local *If TgSplit execution mode,
7086 local address space cannot
7090 2. s_waitcnt lgkmcnt(0)
7093 - Must happen before
7102 older than the local load
7106 load atomic acquire - workgroup - generic 1. flat_load glc=1
7108 - If not TgSplit execution
7111 2. s_waitcnt lgkm/vmcnt(0)
7113 - Use lgkmcnt(0) if not
7114 TgSplit execution mode
7115 and vmcnt(0) if TgSplit
7117 - If OpenCL, omit lgkmcnt(0).
7118 - Must happen before
7120 buffer_wbinvl1_vol and any
7121 following global/generic
7128 older than a local load
7132 3. buffer_wbinvl1_vol
7134 - If not TgSplit execution
7141 load atomic acquire - agent - global 1. buffer/global_load
7143 2. s_waitcnt vmcnt(0)
7145 - Must happen before
7153 3. buffer_wbinvl1_vol
7155 - Must happen before
7165 load atomic acquire - system - global 1. buffer/global/flat_load
7167 2. s_waitcnt vmcnt(0)
7169 - Must happen before
7170 following buffer_invl2 and
7180 - Must happen before
7188 stale L1 global data,
7189 nor see stale L2 MTYPE
7191 MTYPE RW and CC memory will
7192 never be stale in L2 due to
7195 load atomic acquire - agent - generic 1. flat_load glc=1
7196 2. s_waitcnt vmcnt(0) &
7199 - If TgSplit execution mode,
7203 - Must happen before
7206 - Ensures the flat_load
7211 3. buffer_wbinvl1_vol
7213 - Must happen before
7223 load atomic acquire - system - generic 1. flat_load glc=1
7224 2. s_waitcnt vmcnt(0) &
7227 - If TgSplit execution mode,
7231 - Must happen before
7235 - Ensures the flat_load
7243 - Must happen before
7251 stale L1 global data,
7252 nor see stale L2 MTYPE
7254 MTYPE RW and CC memory will
7255 never be stale in L2 due to
7258 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
7259 - wavefront - generic
7260 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
7261 - wavefront local address space cannot
7265 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
7266 2. s_waitcnt vmcnt(0)
7268 - If not TgSplit execution
7270 - Must happen before the
7271 following buffer_wbinvl1_vol.
7272 - Ensures the atomicrmw
7277 3. buffer_wbinvl1_vol
7279 - If not TgSplit execution
7281 - Must happen before
7291 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
7292 local address space cannot
7296 2. s_waitcnt lgkmcnt(0)
7299 - Must happen before
7308 older than the local
7312 atomicrmw acquire - workgroup - generic 1. flat_atomic
7313 2. s_waitcnt lgkm/vmcnt(0)
7315 - Use lgkmcnt(0) if not
7316 TgSplit execution mode
7317 and vmcnt(0) if TgSplit
7319 - If OpenCL, omit lgkmcnt(0).
7320 - Must happen before
7322 buffer_wbinvl1_vol and
7335 3. buffer_wbinvl1_vol
7337 - If not TgSplit execution
7344 atomicrmw acquire - agent - global 1. buffer/global_atomic
7345 2. s_waitcnt vmcnt(0)
7347 - Must happen before
7356 3. buffer_wbinvl1_vol
7358 - Must happen before
7368 atomicrmw acquire - system - global 1. buffer/global_atomic
7369 2. s_waitcnt vmcnt(0)
7371 - Must happen before
7372 following buffer_invl2 and
7383 - Must happen before
7391 stale L1 global data,
7392 nor see stale L2 MTYPE
7394 MTYPE RW and CC memory will
7395 never be stale in L2 due to
7398 atomicrmw acquire - agent - generic 1. flat_atomic
7399 2. s_waitcnt vmcnt(0) &
7402 - If TgSplit execution mode,
7406 - Must happen before
7415 3. buffer_wbinvl1_vol
7417 - Must happen before
7427 atomicrmw acquire - system - generic 1. flat_atomic
7428 2. s_waitcnt vmcnt(0) &
7431 - If TgSplit execution mode,
7435 - Must happen before
7448 - Must happen before
7456 stale L1 global data,
7457 nor see stale L2 MTYPE
7459 MTYPE RW and CC memory will
7460 never be stale in L2 due to
7463 fence acquire - singlethread *none* *none*
7465 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7467 - Use lgkmcnt(0) if not
7468 TgSplit execution mode
7469 and vmcnt(0) if TgSplit
7479 - However, since LLVM
7494 - s_waitcnt vmcnt(0)
7506 fence-paired-atomic).
7507 - s_waitcnt lgkmcnt(0)
7518 fence-paired-atomic).
7519 - Must happen before
7521 buffer_wbinvl1_vol and
7532 fence-paired-atomic.
7534 2. buffer_wbinvl1_vol
7536 - If not TgSplit execution
7543 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
7546 - If TgSplit execution mode,
7552 - However, since LLVM
7560 - Could be split into
7569 - s_waitcnt vmcnt(0)
7580 fence-paired-atomic).
7581 - s_waitcnt lgkmcnt(0)
7592 fence-paired-atomic).
7593 - Must happen before
7607 fence-paired-atomic.
7609 2. buffer_wbinvl1_vol
7611 - Must happen before any
7612 following global/generic
7621 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
7624 - If TgSplit execution mode,
7630 - However, since LLVM
7638 - Could be split into
7647 - s_waitcnt vmcnt(0)
7658 fence-paired-atomic).
7659 - s_waitcnt lgkmcnt(0)
7670 fence-paired-atomic).
7671 - Must happen before
7672 the following buffer_invl2 and
7685 fence-paired-atomic.
7690 - Must happen before any
7691 following global/generic
7698 stale L1 global data,
7699 nor see stale L2 MTYPE
7701 MTYPE RW and CC memory will
7702 never be stale in L2 due to
7705 ------------------------------------------------------------------------------------
7706 store atomic release - singlethread - global 1. buffer/global/flat_store
7707 - wavefront - generic
7708 store atomic release - singlethread - local *If TgSplit execution mode,
7709 - wavefront local address space cannot
7713 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7715 - Use lgkmcnt(0) if not
7716 TgSplit execution mode
7717 and vmcnt(0) if TgSplit
7719 - If OpenCL, omit lgkmcnt(0).
7720 - s_waitcnt vmcnt(0)
7723 global/generic load/store/
7724 load atomic/store atomic/
7726 - s_waitcnt lgkmcnt(0)
7733 - Must happen before
7744 2. buffer/global/flat_store
7745 store atomic release - workgroup - local *If TgSplit execution mode,
7746 local address space cannot
7750 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
7753 - If TgSplit execution mode,
7759 - Could be split into
7768 - s_waitcnt vmcnt(0)
7775 - s_waitcnt lgkmcnt(0)
7782 - Must happen before
7793 2. buffer/global/flat_store
7794 store atomic release - system - global 1. buffer_wbl2
7796 - Must happen before
7797 following s_waitcnt.
7798 - Performs L2 writeback to
7802 visible at system scope.
7804 2. s_waitcnt lgkmcnt(0) &
7807 - If TgSplit execution mode,
7813 - Could be split into
7822 - s_waitcnt vmcnt(0)
7823 must happen after any
7829 - s_waitcnt lgkmcnt(0)
7830 must happen after any
7836 - Must happen before
7841 to memory and the L2
7848 3. buffer/global/flat_store
7849 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
7850 - wavefront - generic
7851 atomicrmw release - singlethread - local *If TgSplit execution mode,
7852 - wavefront local address space cannot
7856 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7858 - Use lgkmcnt(0) if not
7859 TgSplit execution mode
7860 and vmcnt(0) if TgSplit
7864 - s_waitcnt vmcnt(0)
7867 global/generic load/store/
7868 load atomic/store atomic/
7870 - s_waitcnt lgkmcnt(0)
7877 - Must happen before
7888 2. buffer/global/flat_atomic
7889 atomicrmw release - workgroup - local *If TgSplit execution mode,
7890 local address space cannot
7894 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
7897 - If TgSplit execution mode,
7901 - Could be split into
7910 - s_waitcnt vmcnt(0)
7917 - s_waitcnt lgkmcnt(0)
7924 - Must happen before
7935 2. buffer/global/flat_atomic
7936 atomicrmw release - system - global 1. buffer_wbl2
7938 - Must happen before
7939 following s_waitcnt.
7940 - Performs L2 writeback to
7944 visible at system scope.
7946 2. s_waitcnt lgkmcnt(0) &
7949 - If TgSplit execution mode,
7953 - Could be split into
7962 - s_waitcnt vmcnt(0)
7969 - s_waitcnt lgkmcnt(0)
7976 - Must happen before
7981 to memory and the L2
7988 3. buffer/global/flat_atomic
7989 fence release - singlethread *none* *none*
7991 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7993 - Use lgkmcnt(0) if not
7994 TgSplit execution mode
7995 and vmcnt(0) if TgSplit
8005 - However, since LLVM
8020 - s_waitcnt vmcnt(0)
8025 load atomic/store atomic/
8027 - s_waitcnt lgkmcnt(0)
8034 - Must happen before
8043 fence-paired-atomic).
8050 fence-paired-atomic.
8052 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
8055 - If TgSplit execution mode,
8065 - However, since LLVM
8080 - Could be split into
8089 - s_waitcnt vmcnt(0)
8096 - s_waitcnt lgkmcnt(0)
8103 - Must happen before
8112 fence-paired-atomic).
8119 fence-paired-atomic.
8121 fence release - system *none* 1. buffer_wbl2
8126 - Must happen before
8127 following s_waitcnt.
8128 - Performs L2 writeback to
8132 visible at system scope.
8134 2. s_waitcnt lgkmcnt(0) &
8137 - If TgSplit execution mode,
8147 - However, since LLVM
8162 - Could be split into
8171 - s_waitcnt vmcnt(0)
8178 - s_waitcnt lgkmcnt(0)
8185 - Must happen before
8194 fence-paired-atomic).
8201 fence-paired-atomic.
8203 **Acquire-Release Atomic**
8204 ------------------------------------------------------------------------------------
8205 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
8206 - wavefront - generic
8207 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
8208 - wavefront local address space cannot
8212 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8214 - Use lgkmcnt(0) if not
8215 TgSplit execution mode
8216 and vmcnt(0) if TgSplit
8226 - s_waitcnt vmcnt(0)
8229 global/generic load/store/
8230 load atomic/store atomic/
8232 - s_waitcnt lgkmcnt(0)
8239 - Must happen before
8250 2. buffer/global_atomic
8251 3. s_waitcnt vmcnt(0)
8253 - If not TgSplit execution
8255 - Must happen before
8265 4. buffer_wbinvl1_vol
8267 - If not TgSplit execution
8274 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
8275 local address space cannot
8279 2. s_waitcnt lgkmcnt(0)
8282 - Must happen before
8291 older than the local load
8295 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
8297 - Use lgkmcnt(0) if not
8298 TgSplit execution mode
8299 and vmcnt(0) if TgSplit
8303 - s_waitcnt vmcnt(0)
8306 global/generic load/store/
8307 load atomic/store atomic/
8309 - s_waitcnt lgkmcnt(0)
8316 - Must happen before
8328 3. s_waitcnt lgkmcnt(0) &
8331 - If not TgSplit execution
8332 mode, omit vmcnt(0).
8335 - Must happen before
8337 buffer_wbinvl1_vol and
8346 older than a local load
8350 3. buffer_wbinvl1_vol
8352 - If not TgSplit execution
8359 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
8362 - If TgSplit execution mode,
8366 - Could be split into
8375 - s_waitcnt vmcnt(0)
8382 - s_waitcnt lgkmcnt(0)
8389 - Must happen before
8400 2. buffer/global_atomic
8401 3. s_waitcnt vmcnt(0)
8403 - Must happen before
8412 4. buffer_wbinvl1_vol
8414 - Must happen before
8424 atomicrmw acq_rel - system - global 1. buffer_wbl2
8426 - Must happen before
8427 following s_waitcnt.
8428 - Performs L2 writeback to
8432 visible at system scope.
8434 2. s_waitcnt lgkmcnt(0) &
8437 - If TgSplit execution mode,
8441 - Could be split into
8450 - s_waitcnt vmcnt(0)
8457 - s_waitcnt lgkmcnt(0)
8464 - Must happen before
8469 to global and L2 writeback
8470 have completed before
8475 3. buffer/global_atomic
8476 4. s_waitcnt vmcnt(0)
8478 - Must happen before
8479 following buffer_invl2 and
8490 - Must happen before
8498 stale L1 global data,
8499 nor see stale L2 MTYPE
8501 MTYPE RW and CC memory will
8502 never be stale in L2 due to
8505 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
8508 - If TgSplit execution mode,
8512 - Could be split into
8521 - s_waitcnt vmcnt(0)
8528 - s_waitcnt lgkmcnt(0)
8535 - Must happen before
8547 3. s_waitcnt vmcnt(0) &
8550 - If TgSplit execution mode,
8554 - Must happen before
8563 4. buffer_wbinvl1_vol
8565 - Must happen before
8575 atomicrmw acq_rel - system - generic 1. buffer_wbl2
8577 - Must happen before
8578 following s_waitcnt.
8579 - Performs L2 writeback to
8583 visible at system scope.
8585 2. s_waitcnt lgkmcnt(0) &
8588 - If TgSplit execution mode,
8592 - Could be split into
8601 - s_waitcnt vmcnt(0)
8608 - s_waitcnt lgkmcnt(0)
8615 - Must happen before
8620 to global and L2 writeback
8621 have completed before
8627 4. s_waitcnt vmcnt(0) &
8630 - If TgSplit execution mode,
8634 - Must happen before
8635 following buffer_invl2 and
8646 - Must happen before
8654 stale L1 global data,
8655 nor see stale L2 MTYPE
8657 MTYPE RW and CC memory will
8658 never be stale in L2 due to
8661 fence acq_rel - singlethread *none* *none*
8663 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
8665 - Use lgkmcnt(0) if not
8666 TgSplit execution mode
8667 and vmcnt(0) if TgSplit
8686 - s_waitcnt vmcnt(0)
8691 load atomic/store atomic/
8693 - s_waitcnt lgkmcnt(0)
8700 - Must happen before
8723 acquire-fence-paired-atomic)
8744 release-fence-paired-atomic).
8748 - Must happen before
8752 acquire-fence-paired
8753 atomic has completed
8762 acquire-fence-paired-atomic.
8764 2. buffer_wbinvl1_vol
8766 - If not TgSplit execution
8773 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
8776 - If TgSplit execution mode,
8782 - However, since LLVM
8790 - Could be split into
8799 - s_waitcnt vmcnt(0)
8806 - s_waitcnt lgkmcnt(0)
8813 - Must happen before
8818 global/local/generic
8827 acquire-fence-paired-atomic)
8839 global/local/generic
8848 release-fence-paired-atomic).
8853 2. buffer_wbinvl1_vol
8855 - Must happen before
8869 fence acq_rel - system *none* 1. buffer_wbl2
8874 - Must happen before
8875 following s_waitcnt.
8876 - Performs L2 writeback to
8880 visible at system scope.
8882 2. s_waitcnt lgkmcnt(0) &
8885 - If TgSplit execution mode,
8891 - However, since LLVM
8899 - Could be split into
8908 - s_waitcnt vmcnt(0)
8915 - s_waitcnt lgkmcnt(0)
8922 - Must happen before
8923 the following buffer_invl2 and
8927 global/local/generic
8936 acquire-fence-paired-atomic)
8948 global/local/generic
8957 release-fence-paired-atomic).
8965 - Must happen before
8974 stale L1 global data,
8975 nor see stale L2 MTYPE
8977 MTYPE RW and CC memory will
8978 never be stale in L2 due to
8981 **Sequential Consistent Atomic**
8982 ------------------------------------------------------------------------------------
8983 load atomic seq_cst - singlethread - global *Same as corresponding
8984 - wavefront - local load atomic acquire,
8985 - generic except must generate
8986 all instructions even
8988 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8990 - Use lgkmcnt(0) if not
8991 TgSplit execution mode
8992 and vmcnt(0) if TgSplit
8994 - s_waitcnt lgkmcnt(0) must
9007 lgkmcnt(0) and so do
9010 - s_waitcnt vmcnt(0)
9029 consistent global/local
9055 order. The s_waitcnt
9056 could be placed after
9060 make the s_waitcnt be
9067 instructions same as
9070 except must generate
9071 all instructions even
9073 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
9074 local address space cannot
9077 *Same as corresponding
9078 load atomic acquire,
9079 except must generate
9080 all instructions even
9083 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
9084 - system - generic vmcnt(0)
9086 - If TgSplit execution mode,
9088 - Could be split into
9097 - s_waitcnt lgkmcnt(0)
9110 lgkmcnt(0) and so do
9113 - s_waitcnt vmcnt(0)
9158 order. The s_waitcnt
9159 could be placed after
9163 make the s_waitcnt be
9170 instructions same as
9173 except must generate
9174 all instructions even
9176 store atomic seq_cst - singlethread - global *Same as corresponding
9177 - wavefront - local store atomic release,
9178 - workgroup - generic except must generate
9179 - agent all instructions even
9180 - system for OpenCL.*
9181 atomicrmw seq_cst - singlethread - global *Same as corresponding
9182 - wavefront - local atomicrmw acq_rel,
9183 - workgroup - generic except must generate
9184 - agent all instructions even
9185 - system for OpenCL.*
9186 fence seq_cst - singlethread *none* *Same as corresponding
9187 - wavefront fence acq_rel,
9188 - workgroup except must generate
9189 - agent all instructions even
9190 - system for OpenCL.*
9191 ============ ============ ============== ========== ================================
9193 .. _amdgpu-amdhsa-memory-model-gfx940:
9200 * Each agent has multiple shader arrays (SA).
9201 * Each SA has multiple compute units (CU).
9202 * Each CU has multiple SIMDs that execute wavefronts.
9203 * The wavefronts for a single work-group are executed in the same CU but may be
9204 executed by different SIMDs. The exception is when in tgsplit execution mode
9205 when the wavefronts may be executed by different SIMDs in different CUs.
9206 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
9207 executing on it. The exception is when in tgsplit execution mode when no LDS
9208 is allocated as wavefronts of the same work-group can be in different CUs.
9209 * All LDS operations of a CU are performed as wavefront wide operations in a
9210 global order and involve no caching. Completion is reported to a wavefront in
9212 * The LDS memory has multiple request queues shared by the SIMDs of a
9213 CU. Therefore, the LDS operations performed by different wavefronts of a
9214 work-group can be reordered relative to each other, which can result in
9215 reordering the visibility of vector memory operations with respect to LDS
9216 operations of other wavefronts in the same work-group. A ``s_waitcnt
9217 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
9218 vector memory operations between wavefronts of a work-group, but not between
9219 operations performed by the same wavefront.
9220 * The vector memory operations are performed as wavefront wide operations and
9221 completion is reported to a wavefront in execution order. The exception is
9222 that ``flat_load/store/atomic`` instructions can report out of vector memory
9223 order if they access LDS memory, and out of LDS operation order if they access
9225 * The vector memory operations access a single vector L1 cache shared by all
9226 SIMDs a CU. Therefore:
9228 * No special action is required for coherence between the lanes of a single
9231 * No special action is required for coherence between wavefronts in the same
9232 work-group since they execute on the same CU. The exception is when in
9233 tgsplit execution mode as wavefronts of the same work-group can be in
9234 different CUs and so a ``buffer_inv sc0`` is required which will invalidate
9237 * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
9238 between wavefronts executing in different work-groups as they may be
9239 executing on different CUs.
9241 * Atomic read-modify-write instructions implicitly bypass the L1 cache.
9242 Therefore, they do not use the sc0 bit for coherence and instead use it to
9243 indicate if the instruction returns the original value being updated. They
9244 do use sc1 to indicate system or agent scope coherence.
9246 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
9247 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
9248 scalar operations are used in a restricted way so do not impact the memory
9249 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
9250 * The vector and scalar memory operations use an L2 cache.
9252 * The gfx940 can be configured as a number of smaller agents with each having
9253 a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
9254 larger agents with groups of CUs on each agent each sharing separate L2
9256 * The L2 cache has independent channels to service disjoint ranges of virtual
9258 * Each CU has a separate request queue per channel for its associated L2.
9259 Therefore, the vector and scalar memory operations performed by wavefronts
9260 executing with different L1 caches and the same L2 cache can be reordered
9261 relative to each other.
9262 * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
9263 vector memory operations of different CUs. It ensures a previous vector
9264 memory operation has completed before executing a subsequent vector memory
9265 or LDS operation and so can be used to meet the requirements of acquire and
9267 * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
9268 (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
9269 the PTE C-bit set for memory not local to the L2.
9271 * Any local memory cache lines will be automatically invalidated by writes
9272 from CUs associated with other L2 caches, or writes from the CPU, due to
9273 the cache probe caused by the PTE C-bit.
9274 * XGMI accesses from the CPU to local memory may be cached on the CPU.
9275 Subsequent access from the GPU will automatically invalidate or writeback
9276 the CPU cache due to the L2 probe filter.
9277 * To ensure coherence of local memory writes of CUs with different L1 caches
9278 in the same agent a ``buffer_wbl2`` is required. It does nothing if the
9279 agent is configured to have a single L2, or will writeback dirty L2 cache
9280 lines if configured to have multiple L2 caches.
9281 * To ensure coherence of local memory writes of CUs in different agents a
9282 ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
9283 * To ensure coherence of local memory reads of CUs with different L1 caches
9284 in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
9285 agent is configured to have a single L2, or will invalidate non-local L2
9286 cache lines if configured to have multiple L2 caches.
9287 * To ensure coherence of local memory reads of CUs in different agents a
9288 ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
9289 lines if configured to have multiple L2 caches.
9291 * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
9292 UC (uncached) which bypasses the L2.
9294 Scalar memory operations are only used to access memory that is proven to not
9295 change during the execution of the kernel dispatch. This includes constant
9296 address space and global address space for program scope ``const`` variables.
9297 Therefore, the kernel machine code does not have to maintain the scalar cache to
9298 ensure it is coherent with the vector caches. The scalar and vector caches are
9299 invalidated between kernel dispatches by CP since constant address space data
9300 may change between kernel dispatch executions. See
9301 :ref:`amdgpu-amdhsa-memory-spaces`.
9303 The one exception is if scalar writes are used to spill SGPR registers. In this
9304 case the AMDGPU backend ensures the memory location used to spill is never
9305 accessed by vector memory operations at the same time. If scalar writes are used
9306 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
9307 return since the locations may be used for vector memory instructions by a
9308 future wavefront that uses the same scratch area, or a function call that
9309 creates a frame at the same address, respectively. There is no need for a
9310 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
9312 For kernarg backing memory:
9314 * CP invalidates the L1 cache at the start of each kernel dispatch.
9315 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
9316 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
9317 cache. This also causes it to be treated as non-volatile and so is not
9318 invalidated by ``*_vol``.
9319 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
9320 so the L2 cache will be coherent with the CPU and other agents.
9322 Scratch backing memory (which is used for the private address space) is accessed
9323 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
9324 only accessed by a single thread, and is always write-before-read, there is
9325 never a need to invalidate these entries from the L1 cache. Hence all cache
9326 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
9328 The code sequences used to implement the memory model for GFX940 are defined
9329 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`.
9331 .. table:: AMDHSA Memory Model Code Sequences GFX940
9332 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table
9334 ============ ============ ============== ========== ================================
9335 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
9336 Ordering Sync Scope Address GFX940
9338 ============ ============ ============== ========== ================================
9340 ------------------------------------------------------------------------------------
9341 load *none* *none* - global - !volatile & !nontemporal
9343 - private 1. buffer/global/flat_load
9345 - !volatile & nontemporal
9347 1. buffer/global/flat_load
9352 1. buffer/global/flat_load
9354 2. s_waitcnt vmcnt(0)
9356 - Must happen before
9357 any following volatile
9368 load *none* *none* - local 1. ds_load
9369 store *none* *none* - global - !volatile & !nontemporal
9371 - private 1. buffer/global/flat_store
9373 - !volatile & nontemporal
9375 1. buffer/global/flat_store
9380 1. buffer/global/flat_store
9382 2. s_waitcnt vmcnt(0)
9384 - Must happen before
9385 any following volatile
9396 store *none* *none* - local 1. ds_store
9397 **Unordered Atomic**
9398 ------------------------------------------------------------------------------------
9399 load atomic unordered *any* *any* *Same as non-atomic*.
9400 store atomic unordered *any* *any* *Same as non-atomic*.
9401 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
9402 **Monotonic Atomic**
9403 ------------------------------------------------------------------------------------
9404 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
9405 - wavefront - generic
9406 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
9408 load atomic monotonic - singlethread - local *If TgSplit execution mode,
9409 - wavefront local address space cannot
9410 - workgroup be used.*
9413 load atomic monotonic - agent - global 1. buffer/global/flat_load
9415 load atomic monotonic - system - global 1. buffer/global/flat_load
9416 - generic sc0=1 sc1=1
9417 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
9418 - wavefront - generic
9419 store atomic monotonic - workgroup - global 1. buffer/global/flat_store
9421 store atomic monotonic - agent - global 1. buffer/global/flat_store
9423 store atomic monotonic - system - global 1. buffer/global/flat_store
9424 - generic sc0=1 sc1=1
9425 store atomic monotonic - singlethread - local *If TgSplit execution mode,
9426 - wavefront local address space cannot
9427 - workgroup be used.*
9430 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
9431 - wavefront - generic
9434 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
9436 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
9437 - wavefront local address space cannot
9438 - workgroup be used.*
9442 ------------------------------------------------------------------------------------
9443 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
9446 load atomic acquire - workgroup - global 1. buffer/global_load sc0=1
9447 2. s_waitcnt vmcnt(0)
9449 - If not TgSplit execution
9451 - Must happen before the
9452 following buffer_inv.
9456 - If not TgSplit execution
9458 - Must happen before
9469 load atomic acquire - workgroup - local *If TgSplit execution mode,
9470 local address space cannot
9474 2. s_waitcnt lgkmcnt(0)
9477 - Must happen before
9486 older than the local load
9490 load atomic acquire - workgroup - generic 1. flat_load sc0=1
9491 2. s_waitcnt lgkm/vmcnt(0)
9493 - Use lgkmcnt(0) if not
9494 TgSplit execution mode
9495 and vmcnt(0) if TgSplit
9497 - If OpenCL, omit lgkmcnt(0).
9498 - Must happen before
9501 following global/generic
9508 older than a local load
9514 - If not TgSplit execution
9521 load atomic acquire - agent - global 1. buffer/global_load
9523 2. s_waitcnt vmcnt(0)
9525 - Must happen before
9535 - Must happen before
9545 load atomic acquire - system - global 1. buffer/global/flat_load
9547 2. s_waitcnt vmcnt(0)
9549 - Must happen before
9557 3. buffer_inv sc0=1 sc1=1
9559 - Must happen before
9567 stale MTYPE NC global data.
9568 MTYPE RW and CC memory will
9569 never be stale due to the
9572 load atomic acquire - agent - generic 1. flat_load sc1=1
9573 2. s_waitcnt vmcnt(0) &
9576 - If TgSplit execution mode,
9580 - Must happen before
9583 - Ensures the flat_load
9590 - Must happen before
9600 load atomic acquire - system - generic 1. flat_load sc0=1 sc1=1
9601 2. s_waitcnt vmcnt(0) &
9604 - If TgSplit execution mode,
9608 - Must happen before
9611 - Ensures the flat_load
9616 3. buffer_inv sc0=1 sc1=1
9618 - Must happen before
9626 stale MTYPE NC global data.
9627 MTYPE RW and CC memory will
9628 never be stale due to the
9631 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
9632 - wavefront - generic
9633 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
9634 - wavefront local address space cannot
9638 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
9639 2. s_waitcnt vmcnt(0)
9641 - If not TgSplit execution
9643 - Must happen before the
9644 following buffer_inv.
9645 - Ensures the atomicrmw
9652 - If not TgSplit execution
9654 - Must happen before
9664 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
9665 local address space cannot
9669 2. s_waitcnt lgkmcnt(0)
9672 - Must happen before
9681 older than the local
9685 atomicrmw acquire - workgroup - generic 1. flat_atomic
9686 2. s_waitcnt lgkm/vmcnt(0)
9688 - Use lgkmcnt(0) if not
9689 TgSplit execution mode
9690 and vmcnt(0) if TgSplit
9692 - If OpenCL, omit lgkmcnt(0).
9693 - Must happen before
9710 - If not TgSplit execution
9717 atomicrmw acquire - agent - global 1. buffer/global_atomic
9718 2. s_waitcnt vmcnt(0)
9720 - Must happen before
9731 - Must happen before
9741 atomicrmw acquire - system - global 1. buffer/global_atomic
9743 2. s_waitcnt vmcnt(0)
9745 - Must happen before
9754 3. buffer_inv sc0=1 sc1=1
9756 - Must happen before
9764 stale MTYPE NC global data.
9765 MTYPE RW and CC memory will
9766 never be stale due to the
9769 atomicrmw acquire - agent - generic 1. flat_atomic
9770 2. s_waitcnt vmcnt(0) &
9773 - If TgSplit execution mode,
9777 - Must happen before
9788 - Must happen before
9798 atomicrmw acquire - system - generic 1. flat_atomic sc1=1
9799 2. s_waitcnt vmcnt(0) &
9802 - If TgSplit execution mode,
9806 - Must happen before
9815 3. buffer_inv sc0=1 sc1=1
9817 - Must happen before
9825 stale MTYPE NC global data.
9826 MTYPE RW and CC memory will
9827 never be stale due to the
9830 fence acquire - singlethread *none* *none*
9832 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
9834 - Use lgkmcnt(0) if not
9835 TgSplit execution mode
9836 and vmcnt(0) if TgSplit
9846 - However, since LLVM
9861 - s_waitcnt vmcnt(0)
9873 fence-paired-atomic).
9874 - s_waitcnt lgkmcnt(0)
9885 fence-paired-atomic).
9886 - Must happen before
9899 fence-paired-atomic.
9903 - If not TgSplit execution
9910 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
9913 - If TgSplit execution mode,
9919 - However, since LLVM
9927 - Could be split into
9936 - s_waitcnt vmcnt(0)
9947 fence-paired-atomic).
9948 - s_waitcnt lgkmcnt(0)
9959 fence-paired-atomic).
9960 - Must happen before
9974 fence-paired-atomic.
9978 - Must happen before any
9979 following global/generic
9988 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
9991 - If TgSplit execution mode,
9997 - However, since LLVM
10005 - Could be split into
10009 lgkmcnt(0) to allow
10011 independently moved
10014 - s_waitcnt vmcnt(0)
10017 global/generic load
10021 and memory ordering
10025 fence-paired-atomic).
10026 - s_waitcnt lgkmcnt(0)
10033 and memory ordering
10037 fence-paired-atomic).
10038 - Must happen before
10042 fence-paired atomic
10044 before invalidating
10048 locations read must
10052 fence-paired-atomic.
10054 2. buffer_inv sc0=1 sc1=1
10056 - Must happen before any
10057 following global/generic
10067 ------------------------------------------------------------------------------------
10068 store atomic release - singlethread - global 1. buffer/global/flat_store
10069 - wavefront - generic
10070 store atomic release - singlethread - local *If TgSplit execution mode,
10071 - wavefront local address space cannot
10075 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10077 - Use lgkmcnt(0) if not
10078 TgSplit execution mode
10079 and vmcnt(0) if TgSplit
10081 - If OpenCL, omit lgkmcnt(0).
10082 - s_waitcnt vmcnt(0)
10085 global/generic load/store/
10086 load atomic/store atomic/
10088 - s_waitcnt lgkmcnt(0)
10095 - Must happen before
10103 store that is being
10106 2. buffer/global/flat_store sc0=1
10107 store atomic release - workgroup - local *If TgSplit execution mode,
10108 local address space cannot
10112 store atomic release - agent - global 1. buffer_wbl2 sc1=1
10114 - Must happen before
10115 following s_waitcnt.
10116 - Performs L2 writeback to
10119 store/atomicrmw are
10120 visible at agent scope.
10122 2. s_waitcnt lgkmcnt(0) &
10125 - If TgSplit execution mode,
10131 - Could be split into
10135 lgkmcnt(0) to allow
10137 independently moved
10140 - s_waitcnt vmcnt(0)
10147 - s_waitcnt lgkmcnt(0)
10154 - Must happen before
10162 store that is being
10165 3. buffer/global/flat_store sc1=1
10166 store atomic release - system - global 1. buffer_wbl2 sc0=1 sc1=1
10168 - Must happen before
10169 following s_waitcnt.
10170 - Performs L2 writeback to
10173 store/atomicrmw are
10174 visible at system scope.
10176 2. s_waitcnt lgkmcnt(0) &
10179 - If TgSplit execution mode,
10185 - Could be split into
10189 lgkmcnt(0) to allow
10191 independently moved
10194 - s_waitcnt vmcnt(0)
10195 must happen after any
10201 - s_waitcnt lgkmcnt(0)
10202 must happen after any
10208 - Must happen before
10213 to memory and the L2
10217 store that is being
10220 3. buffer/global/flat_store
10222 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
10223 - wavefront - generic
10224 atomicrmw release - singlethread - local *If TgSplit execution mode,
10225 - wavefront local address space cannot
10229 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10231 - Use lgkmcnt(0) if not
10232 TgSplit execution mode
10233 and vmcnt(0) if TgSplit
10237 - s_waitcnt vmcnt(0)
10240 global/generic load/store/
10241 load atomic/store atomic/
10243 - s_waitcnt lgkmcnt(0)
10250 - Must happen before
10261 2. buffer/global/flat_atomic sc0=1
10262 atomicrmw release - workgroup - local *If TgSplit execution mode,
10263 local address space cannot
10267 atomicrmw release - agent - global 1. buffer_wbl2 sc1=1
10269 - Must happen before
10270 following s_waitcnt.
10271 - Performs L2 writeback to
10274 store/atomicrmw are
10275 visible at agent scope.
10277 2. s_waitcnt lgkmcnt(0) &
10280 - If TgSplit execution mode,
10284 - Could be split into
10288 lgkmcnt(0) to allow
10290 independently moved
10293 - s_waitcnt vmcnt(0)
10300 - s_waitcnt lgkmcnt(0)
10307 - Must happen before
10312 to global and local
10318 3. buffer/global/flat_atomic sc1=1
10319 atomicrmw release - system - global 1. buffer_wbl2 sc0=1 sc1=1
10321 - Must happen before
10322 following s_waitcnt.
10323 - Performs L2 writeback to
10326 store/atomicrmw are
10327 visible at system scope.
10329 2. s_waitcnt lgkmcnt(0) &
10332 - If TgSplit execution mode,
10336 - Could be split into
10340 lgkmcnt(0) to allow
10342 independently moved
10345 - s_waitcnt vmcnt(0)
10352 - s_waitcnt lgkmcnt(0)
10359 - Must happen before
10364 to memory and the L2
10368 store that is being
10371 3. buffer/global/flat_atomic
10373 fence release - singlethread *none* *none*
10375 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
10377 - Use lgkmcnt(0) if not
10378 TgSplit execution mode
10379 and vmcnt(0) if TgSplit
10389 - However, since LLVM
10394 always generate. If
10404 - s_waitcnt vmcnt(0)
10409 load atomic/store atomic/
10411 - s_waitcnt lgkmcnt(0)
10418 - Must happen before
10419 any following store
10423 and memory ordering
10427 fence-paired-atomic).
10434 fence-paired-atomic.
10436 fence release - agent *none* 1. buffer_wbl2 sc1=1
10441 - Must happen before
10442 following s_waitcnt.
10443 - Performs L2 writeback to
10446 store/atomicrmw are
10447 visible at agent scope.
10449 2. s_waitcnt lgkmcnt(0) &
10452 - If TgSplit execution mode,
10462 - However, since LLVM
10467 always generate. If
10477 - Could be split into
10481 lgkmcnt(0) to allow
10483 independently moved
10486 - s_waitcnt vmcnt(0)
10493 - s_waitcnt lgkmcnt(0)
10500 - Must happen before
10501 any following store
10505 and memory ordering
10509 fence-paired-atomic).
10516 fence-paired-atomic.
10518 fence release - system *none* 1. buffer_wbl2 sc0=1 sc1=1
10520 - Must happen before
10521 following s_waitcnt.
10522 - Performs L2 writeback to
10525 store/atomicrmw are
10526 visible at system scope.
10528 2. s_waitcnt lgkmcnt(0) &
10531 - If TgSplit execution mode,
10541 - However, since LLVM
10546 always generate. If
10556 - Could be split into
10560 lgkmcnt(0) to allow
10562 independently moved
10565 - s_waitcnt vmcnt(0)
10572 - s_waitcnt lgkmcnt(0)
10579 - Must happen before
10580 any following store
10584 and memory ordering
10588 fence-paired-atomic).
10595 fence-paired-atomic.
10597 **Acquire-Release Atomic**
10598 ------------------------------------------------------------------------------------
10599 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
10600 - wavefront - generic
10601 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
10602 - wavefront local address space cannot
10606 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10608 - Use lgkmcnt(0) if not
10609 TgSplit execution mode
10610 and vmcnt(0) if TgSplit
10614 - Must happen after
10620 - s_waitcnt vmcnt(0)
10623 global/generic load/store/
10624 load atomic/store atomic/
10626 - s_waitcnt lgkmcnt(0)
10633 - Must happen before
10644 2. buffer/global_atomic
10645 3. s_waitcnt vmcnt(0)
10647 - If not TgSplit execution
10649 - Must happen before
10659 4. buffer_inv sc0=1
10661 - If not TgSplit execution
10668 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
10669 local address space cannot
10673 2. s_waitcnt lgkmcnt(0)
10676 - Must happen before
10685 older than the local load
10689 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
10691 - Use lgkmcnt(0) if not
10692 TgSplit execution mode
10693 and vmcnt(0) if TgSplit
10697 - s_waitcnt vmcnt(0)
10700 global/generic load/store/
10701 load atomic/store atomic/
10703 - s_waitcnt lgkmcnt(0)
10710 - Must happen before
10722 3. s_waitcnt lgkmcnt(0) &
10725 - If not TgSplit execution
10726 mode, omit vmcnt(0).
10729 - Must happen before
10740 older than a local load
10744 3. buffer_inv sc0=1
10746 - If not TgSplit execution
10753 atomicrmw acq_rel - agent - global 1. buffer_wbl2 sc1=1
10755 - Must happen before
10756 following s_waitcnt.
10757 - Performs L2 writeback to
10760 store/atomicrmw are
10761 visible at agent scope.
10763 2. s_waitcnt lgkmcnt(0) &
10766 - If TgSplit execution mode,
10770 - Could be split into
10774 lgkmcnt(0) to allow
10776 independently moved
10779 - s_waitcnt vmcnt(0)
10786 - s_waitcnt lgkmcnt(0)
10793 - Must happen before
10804 3. buffer/global_atomic
10805 4. s_waitcnt vmcnt(0)
10807 - Must happen before
10816 5. buffer_inv sc1=1
10818 - Must happen before
10828 atomicrmw acq_rel - system - global 1. buffer_wbl2 sc0=1 sc1=1
10830 - Must happen before
10831 following s_waitcnt.
10832 - Performs L2 writeback to
10835 store/atomicrmw are
10836 visible at system scope.
10838 2. s_waitcnt lgkmcnt(0) &
10841 - If TgSplit execution mode,
10845 - Could be split into
10849 lgkmcnt(0) to allow
10851 independently moved
10854 - s_waitcnt vmcnt(0)
10861 - s_waitcnt lgkmcnt(0)
10868 - Must happen before
10873 to global and L2 writeback
10874 have completed before
10879 3. buffer/global_atomic
10881 4. s_waitcnt vmcnt(0)
10883 - Must happen before
10892 5. buffer_inv sc0=1 sc1=1
10894 - Must happen before
10902 MTYPE NC global data.
10903 MTYPE RW and CC memory will
10904 never be stale due to the
10907 atomicrmw acq_rel - agent - generic 1. buffer_wbl2 sc1=1
10909 - Must happen before
10910 following s_waitcnt.
10911 - Performs L2 writeback to
10914 store/atomicrmw are
10915 visible at agent scope.
10917 2. s_waitcnt lgkmcnt(0) &
10920 - If TgSplit execution mode,
10924 - Could be split into
10928 lgkmcnt(0) to allow
10930 independently moved
10933 - s_waitcnt vmcnt(0)
10940 - s_waitcnt lgkmcnt(0)
10947 - Must happen before
10959 4. s_waitcnt vmcnt(0) &
10962 - If TgSplit execution mode,
10966 - Must happen before
10975 5. buffer_inv sc1=1
10977 - Must happen before
10987 atomicrmw acq_rel - system - generic 1. buffer_wbl2 sc0=1 sc1=1
10989 - Must happen before
10990 following s_waitcnt.
10991 - Performs L2 writeback to
10994 store/atomicrmw are
10995 visible at system scope.
10997 2. s_waitcnt lgkmcnt(0) &
11000 - If TgSplit execution mode,
11004 - Could be split into
11008 lgkmcnt(0) to allow
11010 independently moved
11013 - s_waitcnt vmcnt(0)
11020 - s_waitcnt lgkmcnt(0)
11027 - Must happen before
11032 to global and L2 writeback
11033 have completed before
11038 3. flat_atomic sc1=1
11039 4. s_waitcnt vmcnt(0) &
11042 - If TgSplit execution mode,
11046 - Must happen before
11055 5. buffer_inv sc0=1 sc1=1
11057 - Must happen before
11065 MTYPE NC global data.
11066 MTYPE RW and CC memory will
11067 never be stale due to the
11070 fence acq_rel - singlethread *none* *none*
11072 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
11074 - Use lgkmcnt(0) if not
11075 TgSplit execution mode
11076 and vmcnt(0) if TgSplit
11095 - s_waitcnt vmcnt(0)
11100 load atomic/store atomic/
11102 - s_waitcnt lgkmcnt(0)
11109 - Must happen before
11128 and memory ordering
11132 acquire-fence-paired-atomic)
11145 local/generic store
11149 and memory ordering
11153 release-fence-paired-atomic).
11157 - Must happen before
11161 acquire-fence-paired
11162 atomic has completed
11163 before invalidating
11167 locations read must
11171 acquire-fence-paired-atomic.
11173 3. buffer_inv sc0=1
11175 - If not TgSplit execution
11182 fence acq_rel - agent *none* 1. buffer_wbl2 sc1=1
11187 - Must happen before
11188 following s_waitcnt.
11189 - Performs L2 writeback to
11192 store/atomicrmw are
11193 visible at agent scope.
11195 2. s_waitcnt lgkmcnt(0) &
11198 - If TgSplit execution mode,
11204 - However, since LLVM
11212 - Could be split into
11216 lgkmcnt(0) to allow
11218 independently moved
11221 - s_waitcnt vmcnt(0)
11228 - s_waitcnt lgkmcnt(0)
11235 - Must happen before
11240 global/local/generic
11245 and memory ordering
11249 acquire-fence-paired-atomic)
11251 before invalidating
11261 global/local/generic
11266 and memory ordering
11270 release-fence-paired-atomic).
11275 3. buffer_inv sc1=1
11277 - Must happen before
11291 fence acq_rel - system *none* 1. buffer_wbl2 sc0=1 sc1=1
11296 - Must happen before
11297 following s_waitcnt.
11298 - Performs L2 writeback to
11301 store/atomicrmw are
11302 visible at system scope.
11304 1. s_waitcnt lgkmcnt(0) &
11307 - If TgSplit execution mode,
11313 - However, since LLVM
11321 - Could be split into
11325 lgkmcnt(0) to allow
11327 independently moved
11330 - s_waitcnt vmcnt(0)
11337 - s_waitcnt lgkmcnt(0)
11344 - Must happen before
11349 global/local/generic
11354 and memory ordering
11358 acquire-fence-paired-atomic)
11360 before invalidating
11370 global/local/generic
11375 and memory ordering
11379 release-fence-paired-atomic).
11384 2. buffer_inv sc0=1 sc1=1
11386 - Must happen before
11395 MTYPE NC global data.
11396 MTYPE RW and CC memory will
11397 never be stale due to the
11400 **Sequential Consistent Atomic**
11401 ------------------------------------------------------------------------------------
11402 load atomic seq_cst - singlethread - global *Same as corresponding
11403 - wavefront - local load atomic acquire,
11404 - generic except must generate
11405 all instructions even
11407 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
11409 - Use lgkmcnt(0) if not
11410 TgSplit execution mode
11411 and vmcnt(0) if TgSplit
11413 - s_waitcnt lgkmcnt(0) must
11420 ordering of seq_cst
11426 lgkmcnt(0) and so do
11429 - s_waitcnt vmcnt(0)
11432 global/generic load
11436 ordering of seq_cst
11448 consistent global/local
11449 memory instructions
11455 prevents reordering
11458 seq_cst load. (Note
11464 followed by a store
11471 release followed by
11474 order. The s_waitcnt
11475 could be placed after
11476 seq_store or before
11479 make the s_waitcnt be
11480 as late as possible
11486 instructions same as
11489 except must generate
11490 all instructions even
11492 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
11493 local address space cannot
11496 *Same as corresponding
11497 load atomic acquire,
11498 except must generate
11499 all instructions even
11502 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
11503 - system - generic vmcnt(0)
11505 - If TgSplit execution mode,
11507 - Could be split into
11511 lgkmcnt(0) to allow
11513 independently moved
11516 - s_waitcnt lgkmcnt(0)
11519 global/generic load
11523 ordering of seq_cst
11529 lgkmcnt(0) and so do
11532 - s_waitcnt vmcnt(0)
11535 global/generic load
11539 ordering of seq_cst
11552 memory instructions
11558 prevents reordering
11561 seq_cst load. (Note
11567 followed by a store
11574 release followed by
11577 order. The s_waitcnt
11578 could be placed after
11579 seq_store or before
11582 make the s_waitcnt be
11583 as late as possible
11589 instructions same as
11592 except must generate
11593 all instructions even
11595 store atomic seq_cst - singlethread - global *Same as corresponding
11596 - wavefront - local store atomic release,
11597 - workgroup - generic except must generate
11598 - agent all instructions even
11599 - system for OpenCL.*
11600 atomicrmw seq_cst - singlethread - global *Same as corresponding
11601 - wavefront - local atomicrmw acq_rel,
11602 - workgroup - generic except must generate
11603 - agent all instructions even
11604 - system for OpenCL.*
11605 fence seq_cst - singlethread *none* *Same as corresponding
11606 - wavefront fence acq_rel,
11607 - workgroup except must generate
11608 - agent all instructions even
11609 - system for OpenCL.*
11610 ============ ============ ============== ========== ================================
11612 .. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
11614 Memory Model GFX10-GFX11
11615 ++++++++++++++++++++++++
11619 * Each agent has multiple shader arrays (SA).
11620 * Each SA has multiple work-group processors (WGP).
11621 * Each WGP has multiple compute units (CU).
11622 * Each CU has multiple SIMDs that execute wavefronts.
11623 * The wavefronts for a single work-group are executed in the same
11624 WGP. In CU wavefront execution mode the wavefronts may be executed by
11625 different SIMDs in the same CU. In WGP wavefront execution mode the
11626 wavefronts may be executed by different SIMDs in different CUs in the same
11628 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
11630 * All LDS operations of a WGP are performed as wavefront wide operations in a
11631 global order and involve no caching. Completion is reported to a wavefront in
11633 * The LDS memory has multiple request queues shared by the SIMDs of a
11634 WGP. Therefore, the LDS operations performed by different wavefronts of a
11635 work-group can be reordered relative to each other, which can result in
11636 reordering the visibility of vector memory operations with respect to LDS
11637 operations of other wavefronts in the same work-group. A ``s_waitcnt
11638 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
11639 vector memory operations between wavefronts of a work-group, but not between
11640 operations performed by the same wavefront.
11641 * The vector memory operations are performed as wavefront wide operations.
11642 Completion of load/store/sample operations are reported to a wavefront in
11643 execution order of other load/store/sample operations performed by that
11645 * The vector memory operations access a vector L0 cache. There is a single L0
11646 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
11647 special action is required for coherence between the lanes of a single
11648 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
11649 wavefronts executing in the same work-group as they may be executing on SIMDs
11650 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
11651 required for coherence between wavefronts executing in different work-groups
11652 as they may be executing on different WGPs.
11653 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
11654 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
11655 operations are used in a restricted way so do not impact the memory model. See
11656 :ref:`amdgpu-amdhsa-memory-spaces`.
11657 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
11658 the same SA. Therefore, no special action is required for coherence between
11659 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
11660 required for coherence between wavefronts executing in different work-groups
11661 as they may be executing on different SAs that access different L1s.
11662 * The L1 caches have independent quadrants to service disjoint ranges of virtual
11664 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
11665 vector and scalar memory operations performed by different wavefronts, whether
11666 executing in the same or different work-groups (which may be executing on
11667 different CUs accessing different L0s), can be reordered relative to each
11668 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
11669 synchronization between vector memory operations of different wavefronts. It
11670 ensures a previous vector memory operation has completed before executing a
11671 subsequent vector memory or LDS operation and so can be used to meet the
11672 requirements of acquire, release and sequential consistency.
11673 * The L1 caches use an L2 cache shared by all SAs on the same agent.
11674 * The L2 cache has independent channels to service disjoint ranges of virtual
11676 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
11677 quadrant has a separate request queue per L2 channel. Therefore, the vector
11678 and scalar memory operations performed by wavefronts executing in different
11679 work-groups (which may be executing on different SAs) of an agent can be
11680 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
11681 required to ensure synchronization between vector memory operations of
11682 different SAs. It ensures a previous vector memory operation has completed
11683 before executing a subsequent vector memory and so can be used to meet the
11684 requirements of acquire, release and sequential consistency.
11685 * The L2 cache can be kept coherent with other agents on some targets, or ranges
11686 of virtual addresses can be set up to bypass it to ensure system coherence.
11687 * On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
11688 The MALL cache is fully coherent with GPU memory and has no impact on system
11689 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
11691 Scalar memory operations are only used to access memory that is proven to not
11692 change during the execution of the kernel dispatch. This includes constant
11693 address space and global address space for program scope ``const`` variables.
11694 Therefore, the kernel machine code does not have to maintain the scalar cache to
11695 ensure it is coherent with the vector caches. The scalar and vector caches are
11696 invalidated between kernel dispatches by CP since constant address space data
11697 may change between kernel dispatch executions. See
11698 :ref:`amdgpu-amdhsa-memory-spaces`.
11700 The one exception is if scalar writes are used to spill SGPR registers. In this
11701 case the AMDGPU backend ensures the memory location used to spill is never
11702 accessed by vector memory operations at the same time. If scalar writes are used
11703 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
11704 return since the locations may be used for vector memory instructions by a
11705 future wavefront that uses the same scratch area, or a function call that
11706 creates a frame at the same address, respectively. There is no need for a
11707 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
11709 For kernarg backing memory:
11711 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
11712 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
11713 needing to invalidate the L2 cache.
11714 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
11715 so the L2 cache will be coherent with the CPU and other agents.
11717 Scratch backing memory (which is used for the private address space) is accessed
11718 with MTYPE NC (non-coherent). Since the private address space is only accessed
11719 by a single thread, and is always write-before-read, there is never a need to
11720 invalidate these entries from the L0 or L1 caches.
11722 Wavefronts are executed in native mode with in-order reporting of loads and
11723 sample instructions. In this mode vmcnt reports completion of load, atomic with
11724 return and sample instructions in order, and the vscnt reports the completion of
11725 store and atomic without return in order. See ``MEM_ORDERED`` field in
11726 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
11728 Wavefronts can be executed in WGP or CU wavefront execution mode:
11730 * In WGP wavefront execution mode the wavefronts of a work-group are executed
11731 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
11732 CU L0 caches is required for work-group synchronization. Also accesses to L1
11733 at work-group scope need to be explicitly ordered as the accesses from
11734 different CUs are not ordered.
11735 * In CU wavefront execution mode the wavefronts of a work-group are executed on
11736 the SIMDs of a single CU of the WGP. Therefore, all global memory access by
11737 the work-group access the same L0 which in turn ensures L1 accesses are
11738 ordered and so do not require explicit management of the caches for
11739 work-group synchronization.
11741 See ``WGP_MODE`` field in
11742 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table` and
11743 :ref:`amdgpu-target-features`.
11745 The code sequences used to implement the memory model for GFX10-GFX11 are defined in
11746 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
11748 .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
11749 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
11751 ============ ============ ============== ========== ================================
11752 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
11753 Ordering Sync Scope Address GFX10-GFX11
11755 ============ ============ ============== ========== ================================
11757 ------------------------------------------------------------------------------------
11758 load *none* *none* - global - !volatile & !nontemporal
11760 - private 1. buffer/global/flat_load
11762 - !volatile & nontemporal
11764 1. buffer/global/flat_load
11767 - If GFX10, omit dlc=1.
11771 1. buffer/global/flat_load
11774 2. s_waitcnt vmcnt(0)
11776 - Must happen before
11777 any following volatile
11788 load *none* *none* - local 1. ds_load
11789 store *none* *none* - global - !volatile & !nontemporal
11791 - private 1. buffer/global/flat_store
11793 - !volatile & nontemporal
11795 1. buffer/global/flat_store
11798 - If GFX10, omit dlc=1.
11802 1. buffer/global/flat_store
11805 - If GFX10, omit dlc=1.
11807 2. s_waitcnt vscnt(0)
11809 - Must happen before
11810 any following volatile
11821 store *none* *none* - local 1. ds_store
11822 **Unordered Atomic**
11823 ------------------------------------------------------------------------------------
11824 load atomic unordered *any* *any* *Same as non-atomic*.
11825 store atomic unordered *any* *any* *Same as non-atomic*.
11826 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
11827 **Monotonic Atomic**
11828 ------------------------------------------------------------------------------------
11829 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
11830 - wavefront - generic
11831 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
11834 - If CU wavefront execution
11837 load atomic monotonic - singlethread - local 1. ds_load
11840 load atomic monotonic - agent - global 1. buffer/global/flat_load
11841 - system - generic glc=1 dlc=1
11843 - If GFX11, omit dlc=1.
11845 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
11846 - wavefront - generic
11850 store atomic monotonic - singlethread - local 1. ds_store
11853 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
11854 - wavefront - generic
11858 atomicrmw monotonic - singlethread - local 1. ds_atomic
11862 ------------------------------------------------------------------------------------
11863 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
11864 - wavefront - local
11866 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
11868 - If CU wavefront execution
11871 2. s_waitcnt vmcnt(0)
11873 - If CU wavefront execution
11875 - Must happen before
11876 the following buffer_gl0_inv
11877 and before any following
11885 - If CU wavefront execution
11892 load atomic acquire - workgroup - local 1. ds_load
11893 2. s_waitcnt lgkmcnt(0)
11896 - Must happen before
11897 the following buffer_gl0_inv
11898 and before any following
11899 global/generic load/load
11905 older than the local load
11911 - If CU wavefront execution
11919 load atomic acquire - workgroup - generic 1. flat_load glc=1
11921 - If CU wavefront execution
11924 2. s_waitcnt lgkmcnt(0) &
11927 - If CU wavefront execution
11928 mode, omit vmcnt(0).
11931 - Must happen before
11933 buffer_gl0_inv and any
11934 following global/generic
11941 older than a local load
11947 - If CU wavefront execution
11954 load atomic acquire - agent - global 1. buffer/global_load
11955 - system glc=1 dlc=1
11957 - If GFX11, omit dlc=1.
11959 2. s_waitcnt vmcnt(0)
11961 - Must happen before
11966 before invalidating
11972 - Must happen before
11982 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1
11984 - If GFX11, omit dlc=1.
11986 2. s_waitcnt vmcnt(0) &
11991 - Must happen before
11994 - Ensures the flat_load
11996 before invalidating
12002 - Must happen before
12012 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
12013 - wavefront - local
12015 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
12016 2. s_waitcnt vm/vscnt(0)
12018 - If CU wavefront execution
12020 - Use vmcnt(0) if atomic with
12021 return and vscnt(0) if
12022 atomic with no-return.
12023 - Must happen before
12024 the following buffer_gl0_inv
12025 and before any following
12033 - If CU wavefront execution
12040 atomicrmw acquire - workgroup - local 1. ds_atomic
12041 2. s_waitcnt lgkmcnt(0)
12044 - Must happen before
12050 older than the local
12062 atomicrmw acquire - workgroup - generic 1. flat_atomic
12063 2. s_waitcnt lgkmcnt(0) &
12066 - If CU wavefront execution
12067 mode, omit vm/vscnt(0).
12068 - If OpenCL, omit lgkmcnt(0).
12069 - Use vmcnt(0) if atomic with
12070 return and vscnt(0) if
12071 atomic with no-return.
12072 - Must happen before
12084 - If CU wavefront execution
12091 atomicrmw acquire - agent - global 1. buffer/global_atomic
12092 - system 2. s_waitcnt vm/vscnt(0)
12094 - Use vmcnt(0) if atomic with
12095 return and vscnt(0) if
12096 atomic with no-return.
12097 - Must happen before
12109 - Must happen before
12119 atomicrmw acquire - agent - generic 1. flat_atomic
12120 - system 2. s_waitcnt vm/vscnt(0) &
12125 - Use vmcnt(0) if atomic with
12126 return and vscnt(0) if
12127 atomic with no-return.
12128 - Must happen before
12140 - Must happen before
12150 fence acquire - singlethread *none* *none*
12152 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12153 vmcnt(0) & vscnt(0)
12155 - If CU wavefront execution
12156 mode, omit vmcnt(0) and
12165 vmcnt(0) and vscnt(0).
12166 - However, since LLVM
12171 always generate. If
12181 - Could be split into
12183 vmcnt(0), s_waitcnt
12184 vscnt(0) and s_waitcnt
12185 lgkmcnt(0) to allow
12187 independently moved
12190 - s_waitcnt vmcnt(0)
12193 global/generic load
12195 atomicrmw-with-return-value
12198 and memory ordering
12202 fence-paired-atomic).
12203 - s_waitcnt vscnt(0)
12207 atomicrmw-no-return-value
12210 and memory ordering
12214 fence-paired-atomic).
12215 - s_waitcnt lgkmcnt(0)
12222 and memory ordering
12226 fence-paired-atomic).
12227 - Must happen before
12231 fence-paired atomic
12233 before invalidating
12237 locations read must
12241 fence-paired-atomic.
12245 - If CU wavefront execution
12252 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
12253 - system vmcnt(0) & vscnt(0)
12262 vmcnt(0) and vscnt(0).
12263 - However, since LLVM
12271 - Could be split into
12273 vmcnt(0), s_waitcnt
12274 vscnt(0) and s_waitcnt
12275 lgkmcnt(0) to allow
12277 independently moved
12280 - s_waitcnt vmcnt(0)
12283 global/generic load
12285 atomicrmw-with-return-value
12288 and memory ordering
12292 fence-paired-atomic).
12293 - s_waitcnt vscnt(0)
12297 atomicrmw-no-return-value
12300 and memory ordering
12304 fence-paired-atomic).
12305 - s_waitcnt lgkmcnt(0)
12312 and memory ordering
12316 fence-paired-atomic).
12317 - Must happen before
12321 fence-paired atomic
12323 before invalidating
12327 locations read must
12331 fence-paired-atomic.
12336 - Must happen before any
12337 following global/generic
12347 ------------------------------------------------------------------------------------
12348 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
12349 - wavefront - local
12351 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12352 - generic vmcnt(0) & vscnt(0)
12354 - If CU wavefront execution
12355 mode, omit vmcnt(0) and
12359 - Could be split into
12361 vmcnt(0), s_waitcnt
12362 vscnt(0) and s_waitcnt
12363 lgkmcnt(0) to allow
12365 independently moved
12368 - s_waitcnt vmcnt(0)
12371 global/generic load/load
12373 atomicrmw-with-return-value.
12374 - s_waitcnt vscnt(0)
12380 atomicrmw-no-return-value.
12381 - s_waitcnt lgkmcnt(0)
12388 - Must happen before
12396 store that is being
12399 2. buffer/global/flat_store
12400 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12402 - If CU wavefront execution
12405 - Could be split into
12407 vmcnt(0) and s_waitcnt
12410 independently moved
12413 - s_waitcnt vmcnt(0)
12416 global/generic load/load
12418 atomicrmw-with-return-value.
12419 - s_waitcnt vscnt(0)
12423 store/store atomic/
12424 atomicrmw-no-return-value.
12425 - Must happen before
12433 store that is being
12437 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
12438 - system - generic vmcnt(0) & vscnt(0)
12444 - Could be split into
12446 vmcnt(0), s_waitcnt vscnt(0)
12448 lgkmcnt(0) to allow
12450 independently moved
12453 - s_waitcnt vmcnt(0)
12459 atomicrmw-with-return-value.
12460 - s_waitcnt vscnt(0)
12464 store/store atomic/
12465 atomicrmw-no-return-value.
12466 - s_waitcnt lgkmcnt(0)
12473 - Must happen before
12481 store that is being
12484 2. buffer/global/flat_store
12485 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
12486 - wavefront - local
12488 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12489 - generic vmcnt(0) & vscnt(0)
12491 - If CU wavefront execution
12492 mode, omit vmcnt(0) and
12494 - If OpenCL, omit lgkmcnt(0).
12495 - Could be split into
12497 vmcnt(0), s_waitcnt
12498 vscnt(0) and s_waitcnt
12499 lgkmcnt(0) to allow
12501 independently moved
12504 - s_waitcnt vmcnt(0)
12507 global/generic load/load
12509 atomicrmw-with-return-value.
12510 - s_waitcnt vscnt(0)
12516 atomicrmw-no-return-value.
12517 - s_waitcnt lgkmcnt(0)
12524 - Must happen before
12535 2. buffer/global/flat_atomic
12536 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12538 - If CU wavefront execution
12541 - Could be split into
12543 vmcnt(0) and s_waitcnt
12546 independently moved
12549 - s_waitcnt vmcnt(0)
12552 global/generic load/load
12554 atomicrmw-with-return-value.
12555 - s_waitcnt vscnt(0)
12559 store/store atomic/
12560 atomicrmw-no-return-value.
12561 - Must happen before
12569 store that is being
12573 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
12574 - system - generic vmcnt(0) & vscnt(0)
12578 - Could be split into
12580 vmcnt(0), s_waitcnt
12581 vscnt(0) and s_waitcnt
12582 lgkmcnt(0) to allow
12584 independently moved
12587 - s_waitcnt vmcnt(0)
12592 atomicrmw-with-return-value.
12593 - s_waitcnt vscnt(0)
12597 store/store atomic/
12598 atomicrmw-no-return-value.
12599 - s_waitcnt lgkmcnt(0)
12606 - Must happen before
12611 to global and local
12617 2. buffer/global/flat_atomic
12618 fence release - singlethread *none* *none*
12620 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12621 vmcnt(0) & vscnt(0)
12623 - If CU wavefront execution
12624 mode, omit vmcnt(0) and
12633 vmcnt(0) and vscnt(0).
12634 - However, since LLVM
12639 always generate. If
12649 - Could be split into
12651 vmcnt(0), s_waitcnt
12652 vscnt(0) and s_waitcnt
12653 lgkmcnt(0) to allow
12655 independently moved
12658 - s_waitcnt vmcnt(0)
12664 atomicrmw-with-return-value.
12665 - s_waitcnt vscnt(0)
12669 store/store atomic/
12670 atomicrmw-no-return-value.
12671 - s_waitcnt lgkmcnt(0)
12676 atomic/store atomic/
12678 - Must happen before
12679 any following store
12683 and memory ordering
12687 fence-paired-atomic).
12694 fence-paired-atomic.
12696 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
12697 - system vmcnt(0) & vscnt(0)
12706 vmcnt(0) and vscnt(0).
12707 - However, since LLVM
12712 always generate. If
12722 - Could be split into
12724 vmcnt(0), s_waitcnt
12725 vscnt(0) and s_waitcnt
12726 lgkmcnt(0) to allow
12728 independently moved
12731 - s_waitcnt vmcnt(0)
12736 atomicrmw-with-return-value.
12737 - s_waitcnt vscnt(0)
12741 store/store atomic/
12742 atomicrmw-no-return-value.
12743 - s_waitcnt lgkmcnt(0)
12750 - Must happen before
12751 any following store
12755 and memory ordering
12759 fence-paired-atomic).
12766 fence-paired-atomic.
12768 **Acquire-Release Atomic**
12769 ------------------------------------------------------------------------------------
12770 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
12771 - wavefront - local
12773 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12774 vmcnt(0) & vscnt(0)
12776 - If CU wavefront execution
12777 mode, omit vmcnt(0) and
12781 - Must happen after
12787 - Could be split into
12789 vmcnt(0), s_waitcnt
12790 vscnt(0), and s_waitcnt
12791 lgkmcnt(0) to allow
12793 independently moved
12796 - s_waitcnt vmcnt(0)
12799 global/generic load/load
12801 atomicrmw-with-return-value.
12802 - s_waitcnt vscnt(0)
12808 atomicrmw-no-return-value.
12809 - s_waitcnt lgkmcnt(0)
12816 - Must happen before
12827 2. buffer/global_atomic
12828 3. s_waitcnt vm/vscnt(0)
12830 - If CU wavefront execution
12832 - Use vmcnt(0) if atomic with
12833 return and vscnt(0) if
12834 atomic with no-return.
12835 - Must happen before
12847 - If CU wavefront execution
12854 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12856 - If CU wavefront execution
12859 - Could be split into
12861 vmcnt(0) and s_waitcnt
12864 independently moved
12867 - s_waitcnt vmcnt(0)
12870 global/generic load/load
12872 atomicrmw-with-return-value.
12873 - s_waitcnt vscnt(0)
12877 store/store atomic/
12878 atomicrmw-no-return-value.
12879 - Must happen before
12887 store that is being
12891 3. s_waitcnt lgkmcnt(0)
12894 - Must happen before
12900 older than the local load
12906 - If CU wavefront execution
12914 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) &
12915 vmcnt(0) & vscnt(0)
12917 - If CU wavefront execution
12918 mode, omit vmcnt(0) and
12920 - If OpenCL, omit lgkmcnt(0).
12921 - Could be split into
12923 vmcnt(0), s_waitcnt
12924 vscnt(0) and s_waitcnt
12925 lgkmcnt(0) to allow
12927 independently moved
12930 - s_waitcnt vmcnt(0)
12933 global/generic load/load
12935 atomicrmw-with-return-value.
12936 - s_waitcnt vscnt(0)
12942 atomicrmw-no-return-value.
12943 - s_waitcnt lgkmcnt(0)
12950 - Must happen before
12962 3. s_waitcnt lgkmcnt(0) &
12963 vmcnt(0) & vscnt(0)
12965 - If CU wavefront execution
12966 mode, omit vmcnt(0) and
12968 - If OpenCL, omit lgkmcnt(0).
12969 - Must happen before
12975 older than the load
12981 - If CU wavefront execution
12988 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
12989 - system vmcnt(0) & vscnt(0)
12993 - Could be split into
12995 vmcnt(0), s_waitcnt
12996 vscnt(0) and s_waitcnt
12997 lgkmcnt(0) to allow
12999 independently moved
13002 - s_waitcnt vmcnt(0)
13007 atomicrmw-with-return-value.
13008 - s_waitcnt vscnt(0)
13012 store/store atomic/
13013 atomicrmw-no-return-value.
13014 - s_waitcnt lgkmcnt(0)
13021 - Must happen before
13032 2. buffer/global_atomic
13033 3. s_waitcnt vm/vscnt(0)
13035 - Use vmcnt(0) if atomic with
13036 return and vscnt(0) if
13037 atomic with no-return.
13038 - Must happen before
13050 - Must happen before
13060 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
13061 - system vmcnt(0) & vscnt(0)
13065 - Could be split into
13067 vmcnt(0), s_waitcnt
13068 vscnt(0), and s_waitcnt
13069 lgkmcnt(0) to allow
13071 independently moved
13074 - s_waitcnt vmcnt(0)
13079 atomicrmw-with-return-value.
13080 - s_waitcnt vscnt(0)
13084 store/store atomic/
13085 atomicrmw-no-return-value.
13086 - s_waitcnt lgkmcnt(0)
13093 - Must happen before
13105 3. s_waitcnt vm/vscnt(0) &
13110 - Use vmcnt(0) if atomic with
13111 return and vscnt(0) if
13112 atomic with no-return.
13113 - Must happen before
13125 - Must happen before
13135 fence acq_rel - singlethread *none* *none*
13137 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
13138 vmcnt(0) & vscnt(0)
13140 - If CU wavefront execution
13141 mode, omit vmcnt(0) and
13150 vmcnt(0) and vscnt(0).
13160 - Could be split into
13162 vmcnt(0), s_waitcnt
13163 vscnt(0) and s_waitcnt
13164 lgkmcnt(0) to allow
13166 independently moved
13169 - s_waitcnt vmcnt(0)
13175 atomicrmw-with-return-value.
13176 - s_waitcnt vscnt(0)
13180 store/store atomic/
13181 atomicrmw-no-return-value.
13182 - s_waitcnt lgkmcnt(0)
13187 atomic/store atomic/
13189 - Must happen before
13208 and memory ordering
13212 acquire-fence-paired-atomic)
13225 local/generic store
13229 and memory ordering
13233 release-fence-paired-atomic).
13237 - Must happen before
13241 acquire-fence-paired
13242 atomic has completed
13243 before invalidating
13247 locations read must
13251 acquire-fence-paired-atomic.
13255 - If CU wavefront execution
13262 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
13263 - system vmcnt(0) & vscnt(0)
13272 vmcnt(0) and vscnt(0).
13273 - However, since LLVM
13281 - Could be split into
13283 vmcnt(0), s_waitcnt
13284 vscnt(0) and s_waitcnt
13285 lgkmcnt(0) to allow
13287 independently moved
13290 - s_waitcnt vmcnt(0)
13296 atomicrmw-with-return-value.
13297 - s_waitcnt vscnt(0)
13301 store/store atomic/
13302 atomicrmw-no-return-value.
13303 - s_waitcnt lgkmcnt(0)
13310 - Must happen before
13315 global/local/generic
13320 and memory ordering
13324 acquire-fence-paired-atomic)
13326 before invalidating
13336 global/local/generic
13341 and memory ordering
13345 release-fence-paired-atomic).
13353 - Must happen before
13367 **Sequential Consistent Atomic**
13368 ------------------------------------------------------------------------------------
13369 load atomic seq_cst - singlethread - global *Same as corresponding
13370 - wavefront - local load atomic acquire,
13371 - generic except must generate
13372 all instructions even
13374 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) &
13375 - generic vmcnt(0) & vscnt(0)
13377 - If CU wavefront execution
13378 mode, omit vmcnt(0) and
13380 - Could be split into
13382 vmcnt(0), s_waitcnt
13383 vscnt(0), and s_waitcnt
13384 lgkmcnt(0) to allow
13386 independently moved
13389 - s_waitcnt lgkmcnt(0) must
13396 ordering of seq_cst
13402 lgkmcnt(0) and so do
13405 - s_waitcnt vmcnt(0)
13408 global/generic load
13410 atomicrmw-with-return-value
13412 ordering of seq_cst
13421 - s_waitcnt vscnt(0)
13424 global/generic store
13426 atomicrmw-no-return-value
13428 ordering of seq_cst
13440 consistent global/local
13441 memory instructions
13447 prevents reordering
13450 seq_cst load. (Note
13456 followed by a store
13463 release followed by
13466 order. The s_waitcnt
13467 could be placed after
13468 seq_store or before
13471 make the s_waitcnt be
13472 as late as possible
13478 instructions same as
13481 except must generate
13482 all instructions even
13484 load atomic seq_cst - workgroup - local
13486 1. s_waitcnt vmcnt(0) & vscnt(0)
13488 - If CU wavefront execution
13490 - Could be split into
13492 vmcnt(0) and s_waitcnt
13495 independently moved
13498 - s_waitcnt vmcnt(0)
13501 global/generic load
13503 atomicrmw-with-return-value
13505 ordering of seq_cst
13514 - s_waitcnt vscnt(0)
13517 global/generic store
13519 atomicrmw-no-return-value
13521 ordering of seq_cst
13534 memory instructions
13540 prevents reordering
13543 seq_cst load. (Note
13549 followed by a store
13556 release followed by
13559 order. The s_waitcnt
13560 could be placed after
13561 seq_store or before
13564 make the s_waitcnt be
13565 as late as possible
13571 instructions same as
13574 except must generate
13575 all instructions even
13578 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
13579 - system - generic vmcnt(0) & vscnt(0)
13581 - Could be split into
13583 vmcnt(0), s_waitcnt
13584 vscnt(0) and s_waitcnt
13585 lgkmcnt(0) to allow
13587 independently moved
13590 - s_waitcnt lgkmcnt(0)
13597 ordering of seq_cst
13603 lgkmcnt(0) and so do
13606 - s_waitcnt vmcnt(0)
13609 global/generic load
13611 atomicrmw-with-return-value
13613 ordering of seq_cst
13622 - s_waitcnt vscnt(0)
13625 global/generic store
13627 atomicrmw-no-return-value
13629 ordering of seq_cst
13642 memory instructions
13648 prevents reordering
13651 seq_cst load. (Note
13657 followed by a store
13664 release followed by
13667 order. The s_waitcnt
13668 could be placed after
13669 seq_store or before
13672 make the s_waitcnt be
13673 as late as possible
13679 instructions same as
13682 except must generate
13683 all instructions even
13685 store atomic seq_cst - singlethread - global *Same as corresponding
13686 - wavefront - local store atomic release,
13687 - workgroup - generic except must generate
13688 - agent all instructions even
13689 - system for OpenCL.*
13690 atomicrmw seq_cst - singlethread - global *Same as corresponding
13691 - wavefront - local atomicrmw acq_rel,
13692 - workgroup - generic except must generate
13693 - agent all instructions even
13694 - system for OpenCL.*
13695 fence seq_cst - singlethread *none* *Same as corresponding
13696 - wavefront fence acq_rel,
13697 - workgroup except must generate
13698 - agent all instructions even
13699 - system for OpenCL.*
13700 ============ ============ ============== ========== ================================
13702 .. _amdgpu-amdhsa-trap-handler-abi:
13707 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
13708 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
13709 supports the ``s_trap`` instruction. For usage see:
13711 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
13712 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
13713 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
13715 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
13716 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
13718 =================== =============== =============== =======================================
13719 Usage Code Sequence Trap Handler Description
13721 =================== =============== =============== =======================================
13722 reserved ``s_trap 0x00`` Reserved by hardware.
13723 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap``
13724 ``queue_ptr`` intrinsic (not implemented).
13727 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
13728 ``queue_ptr`` the trap instruction. The associated
13729 queue is signalled to put it into the
13730 error state. When the queue is put in
13731 the error state, the waves executing
13732 dispatches on the queue will be
13734 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
13735 as a no-operation. The trap handler
13736 is entered and immediately returns to
13737 continue execution of the wavefront.
13738 - If the debugger is enabled, causes
13739 the debug trap to be reported by the
13740 debugger and the wavefront is put in
13741 the halt state with the PC at the
13742 instruction. The debugger must
13743 increment the PC and resume the wave.
13744 reserved ``s_trap 0x04`` Reserved.
13745 reserved ``s_trap 0x05`` Reserved.
13746 reserved ``s_trap 0x06`` Reserved.
13747 reserved ``s_trap 0x07`` Reserved.
13748 reserved ``s_trap 0x08`` Reserved.
13749 reserved ``s_trap 0xfe`` Reserved.
13750 reserved ``s_trap 0xff`` Reserved.
13751 =================== =============== =============== =======================================
13755 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
13756 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
13758 =================== =============== =============== =======================================
13759 Usage Code Sequence Trap Handler Description
13761 =================== =============== =============== =======================================
13762 reserved ``s_trap 0x00`` Reserved by hardware.
13763 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for
13764 breakpoints. Causes wave to be halted
13765 with the PC at the trap instruction.
13766 The debugger is responsible to resume
13767 the wave, including the instruction
13768 that the breakpoint overwrote.
13769 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
13770 ``queue_ptr`` the trap instruction. The associated
13771 queue is signalled to put it into the
13772 error state. When the queue is put in
13773 the error state, the waves executing
13774 dispatches on the queue will be
13776 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
13777 as a no-operation. The trap handler
13778 is entered and immediately returns to
13779 continue execution of the wavefront.
13780 - If the debugger is enabled, causes
13781 the debug trap to be reported by the
13782 debugger and the wavefront is put in
13783 the halt state with the PC at the
13784 instruction. The debugger must
13785 increment the PC and resume the wave.
13786 reserved ``s_trap 0x04`` Reserved.
13787 reserved ``s_trap 0x05`` Reserved.
13788 reserved ``s_trap 0x06`` Reserved.
13789 reserved ``s_trap 0x07`` Reserved.
13790 reserved ``s_trap 0x08`` Reserved.
13791 reserved ``s_trap 0xfe`` Reserved.
13792 reserved ``s_trap 0xff`` Reserved.
13793 =================== =============== =============== =======================================
13797 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
13798 :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
13800 =================== =============== ================ ================= =======================================
13801 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
13802 =================== =============== ================ ================= =======================================
13803 reserved ``s_trap 0x00`` Reserved by hardware.
13804 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for
13805 breakpoints. Causes wave to be halted
13806 with the PC at the trap instruction.
13807 The debugger is responsible to resume
13808 the wave, including the instruction
13809 that the breakpoint overwrote.
13810 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at
13811 ``queue_ptr`` the trap instruction. The associated
13812 queue is signalled to put it into the
13813 error state. When the queue is put in
13814 the error state, the waves executing
13815 dispatches on the queue will be
13817 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves
13818 as a no-operation. The trap handler
13819 is entered and immediately returns to
13820 continue execution of the wavefront.
13821 - If the debugger is enabled, causes
13822 the debug trap to be reported by the
13823 debugger and the wavefront is put in
13824 the halt state with the PC at the
13825 instruction. The debugger must
13826 increment the PC and resume the wave.
13827 reserved ``s_trap 0x04`` Reserved.
13828 reserved ``s_trap 0x05`` Reserved.
13829 reserved ``s_trap 0x06`` Reserved.
13830 reserved ``s_trap 0x07`` Reserved.
13831 reserved ``s_trap 0x08`` Reserved.
13832 reserved ``s_trap 0xfe`` Reserved.
13833 reserved ``s_trap 0xff`` Reserved.
13834 =================== =============== ================ ================= =======================================
13836 .. _amdgpu-amdhsa-function-call-convention:
13843 This section is currently incomplete and has inaccuracies. It is WIP that will
13844 be updated as information is determined.
13846 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
13847 addresses. Unswizzled addresses are normal linear addresses.
13849 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
13854 This section describes the call convention ABI for the outer kernel function.
13856 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
13859 The following is not part of the AMDGPU kernel calling convention but describes
13860 how the AMDGPU implements function calls:
13862 1. Clang decides the kernarg layout to match the *HSA Programmer's Language
13865 - All structs are passed directly.
13866 - Lambda values are passed *TBA*.
13870 - Does this really follow HSA rules? Or are structs >16 bytes passed
13872 - What is ABI for lambda values?
13874 4. The kernel performs certain setup in its prolog, as described in
13875 :ref:`amdgpu-amdhsa-kernel-prolog`.
13877 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
13879 Non-Kernel Functions
13880 ++++++++++++++++++++
13882 This section describes the call convention ABI for functions other than the
13883 outer kernel function.
13885 If a kernel has function calls then scratch is always allocated and used for
13886 the call stack which grows from low address to high address using the swizzled
13887 scratch address space.
13889 On entry to a function:
13891 1. SGPR0-3 contain a V# with the following properties (see
13892 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
13894 * Base address pointing to the beginning of the wavefront scratch backing
13896 * Swizzled with dword element size and stride of wavefront size elements.
13898 2. The FLAT_SCRATCH register pair is setup. See
13899 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
13900 3. GFX6-GFX8: M0 register set to the size of LDS in bytes. See
13901 :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
13902 4. The EXEC register is set to the lanes active on entry to the function.
13903 5. MODE register: *TBD*
13904 6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
13906 7. SGPR30-31 return address (RA). The code address that the function must
13907 return to when it completes. The value is undefined if the function is *no
13909 8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
13910 offset relative to the beginning of the wavefront scratch backing memory.
13912 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
13913 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
13916 The unswizzled SP value can be converted into the swizzled SP value by:
13918 | swizzled SP = unswizzled SP / wavefront size
13920 This may be used to obtain the private address space address of stack
13921 objects and to convert this address to a flat address by adding the flat
13922 scratch aperture base address.
13924 The swizzled SP value is always 4 bytes aligned for the ``r600``
13925 architecture and 16 byte aligned for the ``amdgcn`` architecture.
13929 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
13930 OpenCL language which has the largest base type defined as 16 bytes.
13932 On entry, the swizzled SP value is the address of the first function
13933 argument passed on the stack. Other stack passed arguments are positive
13934 offsets from the entry swizzled SP value.
13936 The function may use positive offsets beyond the last stack passed argument
13937 for stack allocated local variables and register spill slots. If necessary,
13938 the function may align these to greater alignment than 16 bytes. After these
13939 the function may dynamically allocate space for such things as runtime sized
13940 ``alloca`` local allocations.
13942 If the function calls another function, it will place any stack allocated
13943 arguments after the last local allocation and adjust SGPR32 to the address
13944 after the last local allocation.
13946 9. All other registers are unspecified.
13947 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
13949 11. Use pass-by-reference (byref) in stead of pass-by-value (byval) for struct
13950 arguments in C ABI. Callee is responsible for allocating stack memory and
13951 copying the value of the struct if modified. Note that the backend still
13952 supports byval for struct arguments.
13954 On exit from a function:
13956 1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
13957 described below. Any registers used are considered clobbered registers.
13958 2. The following registers are preserved and have the same value as on entry:
13963 * All SGPR registers except the clobbered registers of SGPR4-31.
13981 Except the argument registers, the VGPRs clobbered and the preserved
13982 registers are intermixed at regular intervals in order to keep a
13983 similar ratio independent of the number of allocated VGPRs.
13985 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
13986 * Lanes of all VGPRs that are inactive at the call site.
13988 For the AMDGPU backend, an inter-procedural register allocation (IPRA)
13989 optimization may mark some of clobbered SGPR and VGPR registers as
13990 preserved if it can be determined that the called function does not change
13993 2. The PC is set to the RA provided on entry.
13994 3. MODE register: *TBD*.
13995 4. All other registers are clobbered.
13996 5. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
13997 function is available to the caller.
14001 - How are function results returned? The address of structured types is passed
14002 by reference, but what about other types?
14004 The function input arguments are made up of the formal arguments explicitly
14005 declared by the source language function plus the implicit input arguments used
14006 by the implementation.
14008 The source language input arguments are:
14010 1. Any source language implicit ``this`` or ``self`` argument comes first as a
14012 2. Followed by the function formal arguments in left to right source order.
14014 The source language result arguments are:
14016 1. The function result argument.
14018 The source language input or result struct type arguments that are less than or
14019 equal to 16 bytes, are decomposed recursively into their base type fields, and
14020 each field is passed as if a separate argument. For input arguments, if the
14021 called function requires the struct to be in memory, for example because its
14022 address is taken, then the function body is responsible for allocating a stack
14023 location and copying the field arguments into it. Clang terms this *direct
14026 The source language input struct type arguments that are greater than 16 bytes,
14027 are passed by reference. The caller is responsible for allocating a stack
14028 location to make a copy of the struct value and pass the address as the input
14029 argument. The called function is responsible to perform the dereference when
14030 accessing the input argument. Clang terms this *by-value struct*.
14032 A source language result struct type argument that is greater than 16 bytes, is
14033 returned by reference. The caller is responsible for allocating a stack location
14034 to hold the result value and passes the address as the last input argument
14035 (before the implicit input arguments). In this case there are no result
14036 arguments. The called function is responsible to perform the dereference when
14037 storing the result value. Clang terms this *structured return (sret)*.
14039 *TODO: correct the ``sret`` definition.*
14043 Is this definition correct? Or is ``sret`` only used if passing in registers, and
14044 pass as non-decomposed struct as stack argument? Or something else? Is the
14045 memory location in the caller stack frame, or a stack memory argument and so
14046 no address is passed as the caller can directly write to the argument stack
14047 location? But then the stack location is still live after return. If an
14048 argument stack location is it the first stack argument or the last one?
14050 Lambda argument types are treated as struct types with an implementation defined
14055 Need to specify the ABI for lambda types for AMDGPU.
14057 For AMDGPU backend all source language arguments (including the decomposed
14058 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
14059 they are passed in SGPRs.
14061 The AMDGPU backend walks the function call graph from the leaves to determine
14062 which implicit input arguments are used, propagating to each caller of the
14063 function. The used implicit arguments are appended to the function arguments
14064 after the source language arguments in the following order:
14068 Is recursion or external functions supported?
14070 1. Work-Item ID (1 VGPR)
14072 The X, Y and Z work-item ID are packed into a single VGRP with the following
14073 layout. Only fields actually used by the function are set. The other bits
14076 The values come from the initial kernel execution state. See
14077 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14079 .. table:: Work-item implicit argument layout
14080 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
14082 ======= ======= ==============
14083 Bits Size Field Name
14084 ======= ======= ==============
14085 9:0 10 bits X Work-Item ID
14086 19:10 10 bits Y Work-Item ID
14087 29:20 10 bits Z Work-Item ID
14088 31:30 2 bits Unused
14089 ======= ======= ==============
14091 2. Dispatch Ptr (2 SGPRs)
14093 The value comes from the initial kernel execution state. See
14094 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14096 3. Queue Ptr (2 SGPRs)
14098 The value comes from the initial kernel execution state. See
14099 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14101 4. Kernarg Segment Ptr (2 SGPRs)
14103 The value comes from the initial kernel execution state. See
14104 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14106 5. Dispatch id (2 SGPRs)
14108 The value comes from the initial kernel execution state. See
14109 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14111 6. Work-Group ID X (1 SGPR)
14113 The value comes from the initial kernel execution state. See
14114 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14116 7. Work-Group ID Y (1 SGPR)
14118 The value comes from the initial kernel execution state. See
14119 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14121 8. Work-Group ID Z (1 SGPR)
14123 The value comes from the initial kernel execution state. See
14124 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
14126 9. Implicit Argument Ptr (2 SGPRs)
14128 The value is computed by adding an offset to Kernarg Segment Ptr to get the
14129 global address space pointer to the first kernarg implicit argument.
14131 The input and result arguments are assigned in order in the following manner:
14135 There are likely some errors and omissions in the following description that
14140 Check the Clang source code to decipher how function arguments and return
14141 results are handled. Also see the AMDGPU specific values used.
14143 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
14146 If there are more arguments than will fit in these registers, the remaining
14147 arguments are allocated on the stack in order on naturally aligned
14152 How are overly aligned structures allocated on the stack?
14154 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
14157 If there are more arguments than will fit in these registers, the remaining
14158 arguments are allocated on the stack in order on naturally aligned
14161 Note that decomposed struct type arguments may have some fields passed in
14162 registers and some in memory.
14166 So, a struct which can pass some fields as decomposed register arguments, will
14167 pass the rest as decomposed stack elements? But an argument that will not start
14168 in registers will not be decomposed and will be passed as a non-decomposed
14171 The following is not part of the AMDGPU function calling convention but
14172 describes how the AMDGPU implements function calls:
14174 1. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
14175 unswizzled scratch address. It is only needed if runtime sized ``alloca``
14176 are used, or for the reasons defined in ``SIFrameLowering``.
14177 2. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
14178 to access the incoming stack arguments in the function. The BP is needed
14179 only when the function requires the runtime stack alignment.
14181 3. Allocating SGPR arguments on the stack are not supported.
14183 4. No CFI is currently generated. See
14184 :ref:`amdgpu-dwarf-call-frame-information`.
14188 CFI will be generated that defines the CFA as the unswizzled address
14189 relative to the wave scratch base in the unswizzled private address space
14190 of the lowest address stack allocated local variable.
14192 ``DW_AT_frame_base`` will be defined as the swizzled address in the
14193 swizzled private address space by dividing the CFA by the wavefront size
14194 (since CFA is always at least dword aligned which matches the scratch
14195 swizzle element size).
14197 If no dynamic stack alignment was performed, the stack allocated arguments
14198 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
14199 local variables and register spill slots are accessed as positive offsets
14200 relative to ``DW_AT_frame_base``.
14202 5. Function argument passing is implemented by copying the input physical
14203 registers to virtual registers on entry. The register allocator can spill if
14204 necessary. These are copied back to physical registers at call sites. The
14205 net effect is that each function call can have these values in entirely
14206 distinct locations. The IPRA can help avoid shuffling argument registers.
14207 6. Call sites are implemented by setting up the arguments at positive offsets
14208 from SP. Then SP is incremented to account for the known frame size before
14209 the call and decremented after the call.
14213 The CFI will reflect the changed calculation needed to compute the CFA
14216 7. 4 byte spill slots are used in the stack frame. One slot is allocated for an
14217 emergency spill slot. Buffer instructions are used for stack accesses and
14218 not the ``flat_scratch`` instruction.
14222 Explain when the emergency spill slot is used.
14226 Possible broken issues:
14228 - Stack arguments must be aligned to required alignment.
14229 - Stack is aligned to max(16, max formal argument alignment)
14230 - Direct argument < 64 bits should check register budget.
14231 - Register budget calculation should respect ``inreg`` for SGPR.
14232 - SGPR overflow is not handled.
14233 - struct with 1 member unpeeling is not checking size of member.
14234 - ``sret`` is after ``this`` pointer.
14235 - Caller is not implementing stack realignment: need an extra pointer.
14236 - Should say AMDGPU passes FP rather than SP.
14237 - Should CFI define CFA as address of locals or arguments. Difference is
14238 apparent when have implemented dynamic alignment.
14239 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
14240 highest address of stack frame and use negative offset for locals. Would
14241 allow SP to be the same as FP and could support signal-handler-like as now
14242 have a real SP for the top of the stack.
14243 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
14249 This section provides code conventions used when the target triple OS is
14250 ``amdpal`` (see :ref:`amdgpu-target-triples`).
14252 .. _amdgpu-amdpal-code-object-metadata-section:
14254 Code Object Metadata
14255 ~~~~~~~~~~~~~~~~~~~~
14259 The metadata is currently in development and is subject to major
14260 changes. Only the current version is supported. *When this document
14261 was generated the version was 2.6.*
14263 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
14264 record (see :ref:`amdgpu-note-records-v3-onwards`).
14266 The metadata is represented as Message Pack formatted binary data (see
14267 [MsgPack]_). The top level is a Message Pack map that includes the keys
14268 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
14269 and referenced tables.
14271 Additional information can be added to the maps. To avoid conflicts, any
14272 key names should be prefixed by "*vendor-name*." where ``vendor-name``
14273 can be the name of the vendor and specific vendor tool that generates the
14274 information. The prefix is abbreviated to simply "." when it appears
14275 within a map that has been added by the same *vendor-name*.
14277 .. table:: AMDPAL Code Object Metadata Map
14278 :name: amdgpu-amdpal-code-object-metadata-map-table
14280 =================== ============== ========= ======================================================================
14281 String Key Value Type Required? Description
14282 =================== ============== ========= ======================================================================
14283 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values
14284 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
14285 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See
14286 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
14287 definition of the keys included in that map.
14288 =================== ============== ========= ======================================================================
14292 .. table:: AMDPAL Code Object Pipeline Metadata Map
14293 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
14295 ====================================== ============== ========= ===================================================
14296 String Key Value Type Required? Description
14297 ====================================== ============== ========= ===================================================
14298 ".name" string Source name of the pipeline.
14299 ".type" string Pipeline type, e.g. VsPs. Values include:
14309 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower
14310 2 integers 64 bits is the "stable" portion of the hash, used
14311 for e.g. shader replacement lookup. Upper 64 bits
14312 is the "unique" portion of the hash, used for
14313 e.g. pipeline cache lookup. The value is
14314 implementation defined, and can not be relied on
14315 between different builds of the compiler.
14316 ".shaders" map Per-API shader metadata. See
14317 :ref:`amdgpu-amdpal-code-object-shader-map-table`
14318 for the definition of the keys included in that
14320 ".hardware_stages" map Per-hardware stage metadata. See
14321 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
14322 for the definition of the keys included in that
14324 ".shader_functions" map Per-shader function metadata. See
14325 :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
14326 for the definition of the keys included in that
14328 ".registers" map Required Hardware register configuration. See
14329 :ref:`amdgpu-amdpal-code-object-register-map-table`
14330 for the definition of the keys included in that
14332 ".user_data_limit" integer Number of user data entries accessed by this
14334 ".spill_threshold" integer The user data spill threshold. 0xFFFF for
14335 NoUserDataSpilling.
14336 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the
14337 viewport array index feature. Pipelines which use
14338 this feature can render into all 16 viewports,
14339 whereas pipelines which do not use it are
14340 restricted to viewport #0.
14341 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for
14342 handling data-passing between the ES and GS
14343 shader stages. This can be zero if the data is
14344 passed using off-chip buffers. This value should
14345 be used to program all user-SGPRs which have been
14346 marked with "UserDataMapping::EsGsLdsSize"
14347 (typically only the GS and VS HW stages will ever
14348 have a user-SGPR so marked).
14349 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders
14350 (maximum number of threads in a subgroup).
14351 ".num_interpolants" integer Graphics only. Number of PS interpolants.
14352 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used.
14353 ".api" string Name of the client graphics API.
14354 ".api_create_info" binary Graphics API shader create info binary blob. Can
14355 be defined by the driver using the compiler if
14356 they want to be able to correlate API-specific
14357 information used during creation at a later time.
14358 ====================================== ============== ========= ===================================================
14362 .. table:: AMDPAL Code Object Shader Map
14363 :name: amdgpu-amdpal-code-object-shader-map-table
14366 +-------------+--------------+-------------------------------------------------------------------+
14367 |String Key |Value Type |Description |
14368 +=============+==============+===================================================================+
14369 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
14370 |- ".vertex" | |for the definition of the keys included in that map. |
14373 |- ".geometry"| | |
14375 +-------------+--------------+-------------------------------------------------------------------+
14379 .. table:: AMDPAL Code Object API Shader Metadata Map
14380 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
14382 ==================== ============== ========= =====================================================================
14383 String Key Value Type Required? Description
14384 ==================== ============== ========= =====================================================================
14385 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value
14386 2 integers is implementation defined, and can not be relied on between
14387 different builds of the compiler.
14388 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values
14399 ==================== ============== ========= =====================================================================
14403 .. table:: AMDPAL Code Object Hardware Stage Map
14404 :name: amdgpu-amdpal-code-object-hardware-stage-map-table
14406 +-------------+--------------+-----------------------------------------------------------------------+
14407 |String Key |Value Type |Description |
14408 +=============+==============+=======================================================================+
14409 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
14410 |- ".hs" | |for the definition of the keys included in that map. |
14416 +-------------+--------------+-----------------------------------------------------------------------+
14420 .. table:: AMDPAL Code Object Hardware Stage Metadata Map
14421 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
14423 ========================== ============== ========= ===============================================================
14424 String Key Value Type Required? Description
14425 ========================== ============== ========= ===============================================================
14426 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
14427 ".scratch_memory_size" integer Scratch memory size in bytes.
14428 ".lds_size" integer Local Data Share size in bytes.
14429 ".perf_data_buffer_size" integer Performance data buffer size in bytes.
14430 ".vgpr_count" integer Number of VGPRs used.
14431 ".agpr_count" integer Number of AGPRs used.
14432 ".sgpr_count" integer Number of SGPRs used.
14433 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
14434 directive to instruct the compiler to limit the VGPR usage to
14435 be less than or equal to the specified value (only set if
14436 different from HW default).
14437 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW
14439 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
14441 ".wavefront_size" integer Wavefront size (only set if different from HW default).
14442 ".uses_uavs" boolean The shader reads or writes UAVs.
14443 ".uses_rovs" boolean The shader reads or writes ROVs.
14444 ".writes_uavs" boolean The shader writes to one or more UAVs.
14445 ".writes_depth" boolean The shader writes out a depth value.
14446 ".uses_append_consume" boolean The shader uses append and/or consume operations, either
14448 ".uses_prim_id" boolean The shader uses PrimID.
14449 ========================== ============== ========= ===============================================================
14453 .. table:: AMDPAL Code Object Shader Function Map
14454 :name: amdgpu-amdpal-code-object-shader-function-map-table
14456 =============== ============== ====================================================================
14457 String Key Value Type Description
14458 =============== ============== ====================================================================
14459 *symbol name* map *symbol name* is the ELF symbol name of the shader function code
14460 entry address. The value is the function's metadata. See
14461 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
14462 =============== ============== ====================================================================
14466 .. table:: AMDPAL Code Object Shader Function Metadata Map
14467 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
14469 ============================= ============== =================================================================
14470 String Key Value Type Description
14471 ============================= ============== =================================================================
14472 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value
14473 2 integers is implementation defined, and can not be relied on between
14474 different builds of the compiler.
14475 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader.
14476 ".lds_size" integer Size in bytes of LDS memory.
14477 ".vgpr_count" integer Number of VGPRs used by the shader.
14478 ".sgpr_count" integer Number of SGPRs used by the shader.
14479 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader.
14480 ".shader_subtype" string Shader subtype/kind. Values include:
14484 ============================= ============== =================================================================
14488 .. table:: AMDPAL Code Object Register Map
14489 :name: amdgpu-amdpal-code-object-register-map-table
14491 ========================== ============== ====================================================================
14492 32-bit Integer Key Value Type Description
14493 ========================== ============== ====================================================================
14494 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
14495 a GRBM register (i.e., driver accessible GPU register number, not
14496 shader GPR register number). The driver is required to program each
14497 specified register to the corresponding specified value when
14498 executing this pipeline. Typically, the ``reg offsets`` are the
14499 ``uint16_t`` offsets to each register as defined by the hardware
14500 chip headers. The register is set to the provided value. However, a
14501 ``reg offset`` that specifies a user data register (e.g.,
14502 COMPUTE_USER_DATA_0) needs special treatment. See
14503 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
14505 ========================== ============== ====================================================================
14507 .. _amdgpu-amdpal-code-object-user-data-section:
14512 Each hardware stage has a set of 32-bit physical SPI *user data registers*
14513 (either 16 or 32 based on graphics IP and the stage) which can be
14514 written from a command buffer and then loaded into SGPRs when waves are
14515 launched via a subsequent dispatch or draw operation. This is the way
14516 most arguments are passed from the application/runtime to a hardware
14519 PAL abstracts this functionality by exposing a set of 128 *user data
14520 entries* per pipeline a client can use to pass arguments from a command
14521 buffer to one or more shaders in that pipeline. The ELF code object must
14522 specify a mapping from virtualized *user data entries* to physical *user
14523 data registers*, and PAL is responsible for implementing that mapping,
14524 including spilling overflow *user data entries* to memory if needed.
14526 Since the *user data registers* are GRBM-accessible SPI registers, this
14527 mapping is actually embedded in the ``.registers`` metadata entry. For
14528 most registers, the value in that map is a literal 32-bit value that
14529 should be written to the register by the driver. However, when the
14530 register is a *user data register* (any USER_DATA register e.g.,
14531 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
14532 the driver to write either a *user data entry* value or one of several
14533 driver-internal values to the register. This encoding is described in
14534 the following table:
14538 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
14539 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
14540 always be programmed to the address of the GlobalTable, and *user data
14541 register* 1 must always be programmed to the address of the PerShaderTable.
14545 .. table:: AMDPAL User Data Mapping
14546 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14548 ========== ================= ===============================================================================
14549 Value Name Description
14550 ========== ================= ===============================================================================
14551 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14552 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should
14553 always point to *user data register* 0).
14554 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See
14555 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14556 for more detail (should always point to *user data register* 1).
14557 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See
14558 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14560 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14561 reference the draw index in the vertex shader. Only supported by the first
14562 stage in a graphics pipeline.
14563 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in
14564 a graphics pipeline.
14565 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a
14567 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
14568 a buffer containing the grid dimensions for a Compute dispatch operation. The
14569 high half of the address is stored in the next sequential user-SGPR. Only
14570 supported by compute pipelines.
14571 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS
14572 space used for the ES/GS pseudo-ring-buffer for passing data between shader
14574 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic
14575 pipeline instancing.
14576 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This
14577 can only appear for one shader stage per pipeline.
14578 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer.
14579 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can
14580 only appear for one shader stage per pipeline.
14581 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can
14582 only appear for one shader stage per pipeline (PS). These replace color targets
14583 and are completely separate from any UAVs used by the shader. This is optional,
14584 and only used by the PS when UAV exports are used to replace color-target
14585 exports to optimize specific shaders.
14586 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by
14587 some NGG pipelines to perform culling. This value contains the address of the
14588 first of two consecutive registers which provide the full GPU address.
14589 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine.
14590 ========== ================= ===============================================================================
14592 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
14597 Low 32 bits of the GPU address for an optional buffer in the ``.data``
14598 section of the ELF. The high 32 bits of the address match the high 32 bits
14599 of the shader's program counter.
14601 The buffer can be anything the shader compiler needs it for, and
14602 allows each shader to have its own region of the ``.data`` section.
14603 Typically, this could be a table of buffer SRD's and the data pointed to
14604 by the buffer SRD's, but it could be a flat-address region of memory as
14605 well. Its layout and usage are defined by the shader compiler.
14607 Each shader's table in the ``.data`` section is referenced by the symbol
14608 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the
14609 hardware shader stage the data is for. E.g.,
14610 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
14612 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
14617 It is possible for a hardware shader to need access to more *user data
14618 entries* than there are slots available in user data registers for one
14619 or more hardware shader stages. In that case, the PAL runtime expects
14620 the necessary *user data entries* to be spilled to GPU memory and use
14621 one user data register to point to the spilled user data memory. The
14622 value of the *user data entry* must then represent the location where
14623 a shader expects to read the low 32-bits of the table's GPU virtual
14624 address. The *spill table* itself represents a set of 32-bit values
14625 managed by the PAL runtime in GPU-accessible memory that can be made
14626 indirectly accessible to a hardware shader.
14631 This section provides code conventions used when the target triple OS is
14632 empty (see :ref:`amdgpu-target-triples`).
14637 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
14638 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
14639 instructions are handled as follows:
14641 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
14642 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
14644 =============== =============== ===========================================
14645 Usage Code Sequence Description
14646 =============== =============== ===========================================
14647 llvm.trap s_endpgm Causes wavefront to be terminated.
14648 llvm.debugtrap *none* Compiler warning given that there is no
14649 trap handler installed.
14650 =============== =============== ===========================================
14660 When the language is OpenCL the following differences occur:
14662 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14663 2. The AMDGPU backend appends additional arguments to the kernel's explicit
14664 arguments for the AMDHSA OS (see
14665 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
14666 3. Additional metadata is generated
14667 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
14669 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
14670 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
14672 ======== ==== ========= ===========================================
14673 Position Byte Byte Description
14675 ======== ==== ========= ===========================================
14676 1 8 8 OpenCL Global Offset X
14677 2 8 8 OpenCL Global Offset Y
14678 3 8 8 OpenCL Global Offset Z
14679 4 8 8 OpenCL address of printf buffer
14680 5 8 8 OpenCL address of virtual queue used by
14682 6 8 8 OpenCL address of AqlWrap struct used by
14684 7 8 8 Pointer argument used for Multi-gird
14686 ======== ==== ========= ===========================================
14693 When the language is HCC the following differences occur:
14695 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14697 .. _amdgpu-assembler:
14702 AMDGPU backend has LLVM-MC based assembler which is currently in development.
14703 It supports AMDGCN GFX6-GFX11.
14705 This section describes general syntax for instructions and operands.
14710 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
14712 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
14713 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
14715 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
14716 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
14718 The order of operands and modifiers is fixed.
14719 Most modifiers are optional and may be omitted.
14721 Links to detailed instruction syntax description may be found in the following
14722 table. Note that features under development are not included
14723 in this description.
14725 ============= ============================================= =======================================
14726 Architecture Core ISA ISA Variants and Extensions
14727 ============= ============================================= =======================================
14728 GCN 2 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \-
14729 GCN 3, GCN 4 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \-
14730 GCN 5 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
14732 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
14734 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
14736 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
14738 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
14740 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
14742 CDNA 1 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
14744 CDNA 2 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
14746 CDNA 3 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
14748 :doc:`gfx941<AMDGPU/AMDGPUAsmGFX940>`
14750 :doc:`gfx942<AMDGPU/AMDGPUAsmGFX940>`
14752 RDNA 1 :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
14754 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
14756 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
14758 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
14760 RDNA 2 :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>` :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
14762 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
14764 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
14766 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
14768 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
14770 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
14772 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
14774 RDNA 3 :doc:`GFX11<AMDGPU/AMDGPUAsmGFX11>` :doc:`gfx1100<AMDGPU/AMDGPUAsmGFX11>`
14776 :doc:`gfx1101<AMDGPU/AMDGPUAsmGFX11>`
14778 :doc:`gfx1102<AMDGPU/AMDGPUAsmGFX11>`
14780 :doc:`gfx1103<AMDGPU/AMDGPUAsmGFX11>`
14781 ============= ============================================= =======================================
14783 For more information about instructions, their semantics and supported
14784 combinations of operands, refer to one of instruction set architecture manuals
14785 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
14786 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
14787 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, [AMD-GCN-GFX10-RDNA1]_,
14788 [AMD-GCN-GFX10-RDNA2]_ and [AMD-GCN-GFX11-RDNA3]_.
14793 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
14798 Detailed description of modifiers may be found
14799 :doc:`here<AMDGPUModifierSyntax>`.
14801 Instruction Examples
14802 ~~~~~~~~~~~~~~~~~~~~
14807 .. code-block:: nasm
14809 ds_add_u32 v2, v4 offset:16
14810 ds_write_src2_b64 v2 offset0:4 offset1:8
14811 ds_cmpst_f32 v2, v4, v6
14812 ds_min_rtn_f64 v[8:9], v2, v[4:5]
14814 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
14820 .. code-block:: nasm
14822 flat_load_dword v1, v[3:4]
14823 flat_store_dwordx3 v[3:4], v[5:7]
14824 flat_atomic_swap v1, v[3:4], v5 glc
14825 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
14826 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
14828 For full list of supported instructions, refer to "FLAT instructions" in ISA
14834 .. code-block:: nasm
14836 buffer_load_dword v1, off, s[4:7], s1
14837 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
14838 buffer_store_format_xy v[1:2], off, s[4:7], s1
14840 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
14842 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
14848 .. code-block:: nasm
14850 s_load_dword s1, s[2:3], 0xfc
14851 s_load_dwordx8 s[8:15], s[2:3], s4
14852 s_load_dwordx16 s[88:103], s[2:3], s4
14856 For full list of supported instructions, refer to "Scalar Memory Operations" in
14862 .. code-block:: nasm
14865 s_mov_b64 s[0:1], 0x80000000
14867 s_wqm_b64 s[2:3], s[4:5]
14868 s_bcnt0_i32_b64 s1, s[2:3]
14869 s_swappc_b64 s[2:3], s[4:5]
14870 s_cbranch_join s[4:5]
14872 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
14878 .. code-block:: nasm
14880 s_add_u32 s1, s2, s3
14881 s_and_b64 s[2:3], s[4:5], s[6:7]
14882 s_cselect_b32 s1, s2, s3
14883 s_andn2_b32 s2, s4, s6
14884 s_lshr_b64 s[2:3], s[4:5], s6
14885 s_ashr_i32 s2, s4, s6
14886 s_bfm_b64 s[2:3], s4, s6
14887 s_bfe_i64 s[2:3], s[4:5], s6
14888 s_cbranch_g_fork s[4:5], s[6:7]
14890 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
14896 .. code-block:: nasm
14898 s_cmp_eq_i32 s1, s2
14899 s_bitcmp1_b32 s1, s2
14900 s_bitcmp0_b64 s[2:3], s4
14903 For full list of supported instructions, refer to "SOPC Instructions" in ISA
14909 .. code-block:: nasm
14914 s_waitcnt 0 ; Wait for all counters to be 0
14915 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
14916 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
14920 s_sendmsg sendmsg(MSG_INTERRUPT)
14923 For full list of supported instructions, refer to "SOPP Instructions" in ISA
14926 Unless otherwise mentioned, little verification is performed on the operands
14927 of SOPP Instructions, so it is up to the programmer to be familiar with the
14928 range or acceptable values.
14933 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
14934 the assembler will automatically use optimal encoding based on its operands. To
14935 force specific encoding, one can add a suffix to the opcode of the instruction:
14937 * _e32 for 32-bit VOP1/VOP2/VOPC
14938 * _e64 for 64-bit VOP3
14940 * _e64_dpp for VOP3 with DPP
14941 * _sdwa for VOP_SDWA
14943 VOP1/VOP2/VOP3/VOPC examples:
14945 .. code-block:: nasm
14948 v_mov_b32_e32 v1, v2
14950 v_cvt_f64_i32_e32 v[1:2], v2
14951 v_floor_f32_e32 v1, v2
14952 v_bfrev_b32_e32 v1, v2
14953 v_add_f32_e32 v1, v2, v3
14954 v_mul_i32_i24_e64 v1, v2, 3
14955 v_mul_i32_i24_e32 v1, -3, v3
14956 v_mul_i32_i24_e32 v1, -100, v3
14957 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
14958 v_max_f16_e32 v1, v2, v3
14962 .. code-block:: nasm
14964 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
14965 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14966 v_mov_b32 v0, v0 wave_shl:1
14967 v_mov_b32 v0, v0 row_mirror
14968 v_mov_b32 v0, v0 row_bcast:31
14969 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
14970 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14971 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14974 VOP3_DPP examples (Available on GFX11+):
14976 .. code-block:: nasm
14978 v_add_f32_e64_dpp v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7]
14979 v_sqrt_f32_e64_dpp v0, v1 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14980 v_ldexp_f32 v0, v1, v2 dpp8:[0,1,2,3,4,5,6,7]
14984 .. code-block:: nasm
14986 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
14987 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
14988 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
14989 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
14990 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
14992 For full list of supported instructions, refer to "Vector ALU instructions".
14994 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
14996 Code Object V2 Predefined Symbols
14997 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15000 Code object V2 generation is no longer supported by this version of LLVM.
15002 The AMDGPU assembler defines and updates some symbols automatically. These
15003 symbols do not affect code generation.
15005 .option.machine_version_major
15006 +++++++++++++++++++++++++++++
15008 Set to the GFX major generation number of the target being assembled for. For
15009 example, when assembling for a "GFX9" target this will be set to the integer
15010 value "9". The possible GFX major generation numbers are presented in
15011 :ref:`amdgpu-processors`.
15013 .option.machine_version_minor
15014 +++++++++++++++++++++++++++++
15016 Set to the GFX minor generation number of the target being assembled for. For
15017 example, when assembling for a "GFX810" target this will be set to the integer
15018 value "1". The possible GFX minor generation numbers are presented in
15019 :ref:`amdgpu-processors`.
15021 .option.machine_version_stepping
15022 ++++++++++++++++++++++++++++++++
15024 Set to the GFX stepping generation number of the target being assembled for.
15025 For example, when assembling for a "GFX704" target this will be set to the
15026 integer value "4". The possible GFX stepping generation numbers are presented
15027 in :ref:`amdgpu-processors`.
15032 Set to zero each time a
15033 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
15034 encountered. At each instruction, if the current value of this symbol is less
15035 than or equal to the maximum VGPR number explicitly referenced within that
15036 instruction then the symbol value is updated to equal that VGPR number plus
15042 Set to zero each time a
15043 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
15044 encountered. At each instruction, if the current value of this symbol is less
15045 than or equal to the maximum VGPR number explicitly referenced within that
15046 instruction then the symbol value is updated to equal that SGPR number plus
15049 .. _amdgpu-amdhsa-assembler-directives-v2:
15051 Code Object V2 Directives
15052 ~~~~~~~~~~~~~~~~~~~~~~~~~
15055 Code object V2 generation is no longer supported by this version of LLVM.
15057 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
15058 one can specify them with assembler directives.
15060 .hsa_code_object_version major, minor
15061 +++++++++++++++++++++++++++++++++++++
15063 *major* and *minor* are integers that specify the version of the HSA code
15064 object that will be generated by the assembler.
15066 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
15067 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15070 *major*, *minor*, and *stepping* are all integers that describe the instruction
15071 set architecture (ISA) version of the assembly program.
15073 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
15074 "AMD" and *arch* should always be equal to "AMDGPU".
15076 By default, the assembler will derive the ISA version, *vendor*, and *arch*
15077 from the value of the -mcpu option that is passed to the assembler.
15079 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
15081 .amdgpu_hsa_kernel (name)
15082 +++++++++++++++++++++++++
15084 This directives specifies that the symbol with given name is a kernel entry
15085 point (label) and the object should contain corresponding symbol of type
15086 STT_AMDGPU_HSA_KERNEL.
15091 This directive marks the beginning of a list of key / value pairs that are used
15092 to specify the amd_kernel_code_t object that will be emitted by the assembler.
15093 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
15094 amd_kernel_code_t values that are unspecified a default value will be used. The
15095 default value for all keys is 0, with the following exceptions:
15097 - *amd_code_version_major* defaults to 1.
15098 - *amd_kernel_code_version_minor* defaults to 2.
15099 - *amd_machine_kind* defaults to 1.
15100 - *amd_machine_version_major*, *machine_version_minor*, and
15101 *amd_machine_version_stepping* are derived from the value of the -mcpu option
15102 that is passed to the assembler.
15103 - *kernel_code_entry_byte_offset* defaults to 256.
15104 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
15105 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
15106 Note that wavefront size is specified as a power of two, so a value of **n**
15107 means a size of 2^ **n**.
15108 - *call_convention* defaults to -1.
15109 - *kernarg_segment_alignment*, *group_segment_alignment*, and
15110 *private_segment_alignment* default to 4. Note that alignments are specified
15111 as a power of 2, so a value of **n** means an alignment of 2^ **n**.
15112 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
15114 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
15116 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
15118 The *.amd_kernel_code_t* directive must be placed immediately after the
15119 function label and before any instructions.
15121 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
15122 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
15124 .. _amdgpu-amdhsa-assembler-example-v2:
15126 Code Object V2 Example Source Code
15127 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15130 Code object V2 generation is no longer supported by this version of LLVM.
15132 Here is an example of a minimal assembly source file, defining one HSA kernel:
15137 .hsa_code_object_version 1,0
15138 .hsa_code_object_isa
15143 .amdgpu_hsa_kernel hello_world
15148 enable_sgpr_kernarg_segment_ptr = 1
15150 compute_pgm_rsrc1_vgprs = 0
15151 compute_pgm_rsrc1_sgprs = 0
15152 compute_pgm_rsrc2_user_sgpr = 2
15153 compute_pgm_rsrc1_wgp_mode = 0
15154 compute_pgm_rsrc1_mem_ordered = 0
15155 compute_pgm_rsrc1_fwd_progress = 1
15156 .end_amd_kernel_code_t
15158 s_load_dwordx2 s[0:1], s[0:1] 0x0
15159 v_mov_b32 v0, 3.14159
15160 s_waitcnt lgkmcnt(0)
15163 flat_store_dword v[1:2], v0
15166 .size hello_world, .Lfunc_end0-hello_world
15168 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
15170 Code Object V3 and Above Predefined Symbols
15171 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15173 The AMDGPU assembler defines and updates some symbols automatically. These
15174 symbols do not affect code generation.
15176 .amdgcn.gfx_generation_number
15177 +++++++++++++++++++++++++++++
15179 Set to the GFX major generation number of the target being assembled for. For
15180 example, when assembling for a "GFX9" target this will be set to the integer
15181 value "9". The possible GFX major generation numbers are presented in
15182 :ref:`amdgpu-processors`.
15184 .amdgcn.gfx_generation_minor
15185 ++++++++++++++++++++++++++++
15187 Set to the GFX minor generation number of the target being assembled for. For
15188 example, when assembling for a "GFX810" target this will be set to the integer
15189 value "1". The possible GFX minor generation numbers are presented in
15190 :ref:`amdgpu-processors`.
15192 .amdgcn.gfx_generation_stepping
15193 +++++++++++++++++++++++++++++++
15195 Set to the GFX stepping generation number of the target being assembled for.
15196 For example, when assembling for a "GFX704" target this will be set to the
15197 integer value "4". The possible GFX stepping generation numbers are presented
15198 in :ref:`amdgpu-processors`.
15200 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
15202 .amdgcn.next_free_vgpr
15203 ++++++++++++++++++++++
15205 Set to zero before assembly begins. At each instruction, if the current value
15206 of this symbol is less than or equal to the maximum VGPR number explicitly
15207 referenced within that instruction then the symbol value is updated to equal
15208 that VGPR number plus one.
15210 May be used to set the `.amdhsa_next_free_vgpr` directive in
15211 :ref:`amdhsa-kernel-directives-table`.
15213 May be set at any time, e.g. manually set to zero at the start of each kernel.
15215 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
15217 .amdgcn.next_free_sgpr
15218 ++++++++++++++++++++++
15220 Set to zero before assembly begins. At each instruction, if the current value
15221 of this symbol is less than or equal the maximum SGPR number explicitly
15222 referenced within that instruction then the symbol value is updated to equal
15223 that SGPR number plus one.
15225 May be used to set the `.amdhsa_next_free_spgr` directive in
15226 :ref:`amdhsa-kernel-directives-table`.
15228 May be set at any time, e.g. manually set to zero at the start of each kernel.
15230 .. _amdgpu-amdhsa-assembler-directives-v3-onwards:
15232 Code Object V3 and Above Directives
15233 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15235 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
15236 architecture processors, and are not OS-specific. Directives which begin with
15237 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
15238 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
15239 :ref:`amdgpu-processors`.
15241 .. _amdgpu-assembler-directive-amdgcn-target:
15243 .amdgcn_target <target-triple> "-" <target-id>
15244 ++++++++++++++++++++++++++++++++++++++++++++++
15246 Optional directive which declares the ``<target-triple>-<target-id>`` supported
15247 by the containing assembler source file. Used by the assembler to validate
15248 command-line options such as ``-triple``, ``-mcpu``, and
15249 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
15250 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
15254 The target ID syntax used for code object V2 to V3 for this directive differs
15255 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
15257 .amdhsa_kernel <name>
15258 +++++++++++++++++++++
15260 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
15261 ``<name>.kd``, in the current location of the current section. Only valid when
15262 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
15263 instruction to execute, and does not need to be previously defined.
15265 Marks the beginning of a list of directives used to generate the bytes of a
15266 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
15267 Directives which may appear in this list are described in
15268 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
15269 be valid for the target being assembled for, and cannot be repeated. Directives
15270 support the range of values specified by the field they reference in
15271 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
15272 assumed to have its default value, unless it is marked as "Required", in which
15273 case it is an error to omit the directive. This list of directives is
15274 terminated by an ``.end_amdhsa_kernel`` directive.
15276 .. table:: AMDHSA Kernel Assembler Directives
15277 :name: amdhsa-kernel-directives-table
15279 ======================================================== =================== ============ ===================
15280 Directive Default Supported On Description
15281 ======================================================== =================== ============ ===================
15282 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX11 Controls GROUP_SEGMENT_FIXED_SIZE in
15283 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15284 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX11 Controls PRIVATE_SEGMENT_FIXED_SIZE in
15285 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15286 ``.amdhsa_kernarg_size`` 0 GFX6-GFX11 Controls KERNARG_SIZE in
15287 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15288 ``.amdhsa_user_sgpr_count`` 0 GFX6-GFX11 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
15289 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`
15290 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
15291 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15293 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_DISPATCH_PTR in
15294 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15295 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_QUEUE_PTR in
15296 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15297 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
15298 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15299 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX11 Controls ENABLE_SGPR_DISPATCH_ID in
15300 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15301 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
15302 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15304 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX11 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
15305 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15306 ``.amdhsa_wavefront_size32`` Target GFX10-GFX11 Controls ENABLE_WAVEFRONT_SIZE32 in
15307 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15310 ``.amdhsa_uses_dynamic_stack`` 0 GFX6-GFX11 Controls USES_DYNAMIC_STACK in
15311 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15312 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in
15313 (except :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15315 ``.amdhsa_enable_private_segment`` 0 GFX940, Controls ENABLE_PRIVATE_SEGMENT in
15316 GFX11 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15317 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_X in
15318 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15319 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
15320 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15321 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
15322 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15323 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_INFO in
15324 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15325 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX11 Controls ENABLE_VGPR_WORKITEM_ID in
15326 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15327 Possible values are defined in
15328 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
15329 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX11 Maximum VGPR number explicitly referenced, plus one.
15330 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
15331 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15332 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX11 Maximum SGPR number explicitly referenced, plus one.
15333 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15334 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15335 ``.amdhsa_accum_offset`` Required GFX90A, Offset of a first AccVGPR in the unified register file.
15336 GFX940 Used to calculate ACCUM_OFFSET in
15337 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
15338 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX11 Whether the kernel may use the special VCC SGPR.
15339 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15340 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15341 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access
15342 (except scratch memory. Used to calculate
15343 GFX940) GRANULATED_WAVEFRONT_SGPR_COUNT in
15344 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15345 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay.
15346 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
15347 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15349 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX11 Controls FLOAT_ROUND_MODE_32 in
15350 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15351 Possible values are defined in
15352 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15353 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX11 Controls FLOAT_ROUND_MODE_16_64 in
15354 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15355 Possible values are defined in
15356 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15357 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX11 Controls FLOAT_DENORM_MODE_32 in
15358 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15359 Possible values are defined in
15360 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15361 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX11 Controls FLOAT_DENORM_MODE_16_64 in
15362 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15363 Possible values are defined in
15364 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15365 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX11 Controls ENABLE_DX10_CLAMP in
15366 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15367 ``.amdhsa_ieee_mode`` 1 GFX6-GFX11 Controls ENABLE_IEEE_MODE in
15368 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15369 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX11 Controls FP16_OVFL in
15370 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15371 ``.amdhsa_tg_split`` Target GFX90A, Controls TG_SPLIT in
15372 Feature GFX940, :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
15375 ``.amdhsa_workgroup_processor_mode`` Target GFX10-GFX11 Controls ENABLE_WGP_MODE in
15376 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15379 ``.amdhsa_memory_ordered`` 1 GFX10-GFX11 Controls MEM_ORDERED in
15380 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15381 ``.amdhsa_forward_progress`` 0 GFX10-GFX11 Controls FWD_PROGRESS in
15382 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15383 ``.amdhsa_shared_vgpr_count`` 0 GFX10-GFX11 Controls SHARED_VGPR_COUNT in
15384 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
15385 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
15386 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15387 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
15388 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15389 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
15390 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15391 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
15392 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15393 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
15394 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15395 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
15396 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15397 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
15398 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15399 ``.amdhsa_user_sgpr_kernarg_preload_length`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_LENGTH in
15400 GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15401 ``.amdhsa_user_sgpr_kernarg_preload_offset`` 0 GFX90A, Controls KERNARG_PRELOAD_SPEC_OFFSET in
15402 GFX940 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15403 ======================================================== =================== ============ ===================
15408 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
15409 note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
15411 The contents must be in the [YAML]_ markup format, with the same structure and
15412 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
15413 :ref:`amdgpu-amdhsa-code-object-metadata-v4` or
15414 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
15416 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
15418 .. _amdgpu-amdhsa-assembler-example-v3-onwards:
15420 Code Object V3 and Above Example Source Code
15421 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15423 Here is an example of a minimal assembly source file, defining one HSA kernel:
15428 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15433 .type hello_world,@function
15435 s_load_dwordx2 s[0:1], s[0:1] 0x0
15436 v_mov_b32 v0, 3.14159
15437 s_waitcnt lgkmcnt(0)
15440 flat_store_dword v[1:2], v0
15443 .size hello_world, .Lfunc_end0-hello_world
15447 .amdhsa_kernel hello_world
15448 .amdhsa_user_sgpr_kernarg_segment_ptr 1
15449 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15450 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15459 - .name: hello_world
15460 .symbol: hello_world.kd
15461 .kernarg_segment_size: 48
15462 .group_segment_fixed_size: 0
15463 .private_segment_fixed_size: 0
15464 .kernarg_segment_align: 4
15465 .wavefront_size: 64
15468 .max_flat_workgroup_size: 256
15472 .value_kind: global_buffer
15473 .address_space: global
15474 .actual_access: write_only
15476 .end_amdgpu_metadata
15478 This kernel is equivalent to the following HIP program:
15483 __global__ void hello_world(float *p) {
15487 If an assembly source file contains multiple kernels and/or functions, the
15488 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
15489 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
15490 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
15491 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
15492 to group the function with the kernel that calls it and reset the symbols
15493 between the two connected components:
15498 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15500 // gpr tracking symbols are implicitly set to zero
15505 .type kern0,@function
15510 .size kern0, .Lkern0_end-kern0
15514 .amdhsa_kernel kern0
15516 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15517 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15520 // reset symbols to begin tracking usage in func1 and kern1
15521 .set .amdgcn.next_free_vgpr, 0
15522 .set .amdgcn.next_free_sgpr, 0
15528 .type func1,@function
15531 s_setpc_b64 s[30:31]
15533 .size func1, .Lfunc1_end-func1
15537 .type kern1,@function
15541 s_add_u32 s4, s4, func1@rel32@lo+4
15542 s_addc_u32 s5, s5, func1@rel32@lo+4
15543 s_swappc_b64 s[30:31], s[4:5]
15547 .size kern1, .Lkern1_end-kern1
15551 .amdhsa_kernel kern1
15553 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15554 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15557 These symbols cannot identify connected components in order to automatically
15558 track the usage for each kernel. However, in some cases careful organization of
15559 the kernels and functions in the source file means there is minimal additional
15560 effort required to accurately calculate GPR usage.
15562 Additional Documentation
15563 ========================
15565 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
15566 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
15567 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
15568 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
15569 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
15570 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
15571 .. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
15572 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
15573 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
15574 .. [AMD-GCN-GFX11-RDNA3] `AMD RDNA 3 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf>`__
15575 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
15576 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
15577 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
15578 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
15579 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
15580 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
15581 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
15582 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
15583 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
15584 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
15585 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
15586 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
15587 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
15588 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
15589 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
15590 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__