1 =============================
2 User Guide for AMDGPU Backend
3 =============================
14 AMDGPU/AMDGPUAsmGFX900
15 AMDGPU/AMDGPUAsmGFX904
16 AMDGPU/AMDGPUAsmGFX906
17 AMDGPU/AMDGPUAsmGFX908
18 AMDGPU/AMDGPUAsmGFX90a
19 AMDGPU/AMDGPUAsmGFX940
21 AMDGPU/AMDGPUAsmGFX1011
22 AMDGPU/AMDGPUAsmGFX1013
23 AMDGPU/AMDGPUAsmGFX1030
27 AMDGPUInstructionSyntax
28 AMDGPUInstructionNotation
29 AMDGPUDwarfExtensionsForHeterogeneousDebugging
30 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
35 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
36 R600 family up until the current GCN families. It lives in the
37 ``llvm/lib/Target/AMDGPU`` directory.
42 .. _amdgpu-target-triples:
47 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
48 to specify the target triple:
50 .. table:: AMDGPU Architectures
51 :name: amdgpu-architecture-table
53 ============ ==============================================================
54 Architecture Description
55 ============ ==============================================================
56 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
57 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
58 ============ ==============================================================
60 .. table:: AMDGPU Vendors
61 :name: amdgpu-vendor-table
63 ============ ==============================================================
65 ============ ==============================================================
66 ``amd`` Can be used for all AMD GPU usage.
67 ``mesa3d`` Can be used if the OS is ``mesa3d``.
68 ============ ==============================================================
70 .. table:: AMDGPU Operating Systems
73 ============== ============================================================
75 ============== ============================================================
76 *<empty>* Defaults to the *unknown* OS.
77 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
80 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
81 loader on Linux. See *AMD ROCm Platform Release Notes*
82 [AMD-ROCm-Release-Notes]_ for supported hardware and
84 - AMD's PAL runtime using the *pal-amdhsa* loader on
87 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL
88 runtime using the *pal-amdpal* loader on Windows and Linux
90 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa
91 3D runtime using the *mesa-mesa3d* loader on Linux.
92 ============== ============================================================
94 .. table:: AMDGPU Environments
95 :name: amdgpu-environment-table
97 ============ ==============================================================
98 Environment Description
99 ============ ==============================================================
101 ============ ==============================================================
103 .. _amdgpu-processors:
108 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
109 specify the AMDGPU processor together with optional target features. See
110 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
111 specific information.
113 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
115 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
118 .. table:: AMDGPU Processors
119 :name: amdgpu-processor-table
121 =========== =============== ============ ===== ================= =============== =============== ======================
122 Processor Alternative Target dGPU/ Target Target OS Support Example
123 Processor Triple APU Features Properties *(see* Products
124 Architecture Supported `amdgpu-os`_
133 =========== =============== ============ ===== ================= =============== =============== ======================
134 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
135 -----------------------------------------------------------------------------------------------------------------------
136 ``r600`` ``r600`` dGPU - Does not
141 ``r630`` ``r600`` dGPU - Does not
146 ``rs880`` ``r600`` dGPU - Does not
151 ``rv670`` ``r600`` dGPU - Does not
156 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
157 -----------------------------------------------------------------------------------------------------------------------
158 ``rv710`` ``r600`` dGPU - Does not
163 ``rv730`` ``r600`` dGPU - Does not
168 ``rv770`` ``r600`` dGPU - Does not
173 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
174 -----------------------------------------------------------------------------------------------------------------------
175 ``cedar`` ``r600`` dGPU - Does not
180 ``cypress`` ``r600`` dGPU - Does not
185 ``juniper`` ``r600`` dGPU - Does not
190 ``redwood`` ``r600`` dGPU - Does not
195 ``sumo`` ``r600`` dGPU - Does not
200 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
201 -----------------------------------------------------------------------------------------------------------------------
202 ``barts`` ``r600`` dGPU - Does not
207 ``caicos`` ``r600`` dGPU - Does not
212 ``cayman`` ``r600`` dGPU - Does not
217 ``turks`` ``r600`` dGPU - Does not
222 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
223 -----------------------------------------------------------------------------------------------------------------------
224 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
229 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
234 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
239 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
240 -----------------------------------------------------------------------------------------------------------------------
241 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000
242 flat - *pal-amdhsa* - A6 Pro-7050B
243 scratch - *pal-amdpal* - A8-7100
251 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100
252 flat - *pal-amdhsa* - FirePro W9100
253 scratch - *pal-amdpal* - FirePro S9150
255 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290
256 flat - *pal-amdhsa* - Radeon R9 290x
257 scratch - *pal-amdpal* - Radeon R390
259 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100
260 - ``mullins`` flat - *pal-amdpal* - E1-2200
268 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790
269 flat - *pal-amdpal* - Radeon HD 8770
272 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA*
279 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
280 -----------------------------------------------------------------------------------------------------------------------
281 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P
282 flat - *pal-amdhsa* - Pro A6-8500B
283 scratch - *pal-amdpal* - A8-8600P
299 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285
300 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380
301 scratch - *pal-amdpal* - Radeon R9 385
302 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano
303 - *pal-amdhsa* - Radeon R9 Fury
304 - *pal-amdpal* - Radeon R9 FuryX
307 - Radeon Instinct MI8
308 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470
309 flat - *pal-amdhsa* - Radeon RX 480
310 scratch - *pal-amdpal* - Radeon Instinct MI6
311 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460
313 scratch - *pal-amdpal*
314 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150
315 flat - *pal-amdhsa* - FirePro S7100
316 scratch - *pal-amdpal* - FirePro W7100
319 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA*
321 scratch - *pal-amdpal* .. TODO::
326 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_
327 -----------------------------------------------------------------------------------------------------------------------
328 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega
329 flat - *pal-amdhsa* Frontier Edition
330 scratch - *pal-amdpal* - Radeon RX Vega 56
334 - Radeon Instinct MI25
335 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G
336 flat - *pal-amdhsa* - Ryzen 5 2400G
337 scratch - *pal-amdpal*
338 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA*
340 - *pal-amdpal* .. TODO::
345 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50
346 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60
347 scratch - *pal-amdpal* - Radeon VII
349 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
353 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA*
360 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA*
362 - xnack scratch .. TODO::
364 work-item Add product
367 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G
368 flat - Ryzen 7 4700GE
369 scratch - Ryzen 5 4600G
381 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
383 - xnack scratch .. TODO::
385 work-item Add product
388 ``gfx941`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
390 - xnack scratch .. TODO::
392 work-item Add product
395 ``gfx942`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
397 - xnack scratch .. TODO::
399 work-item Add product
402 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
403 -----------------------------------------------------------------------------------------------------------------------
404 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700
405 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT
406 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT
408 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520
409 - wavefrontsize64 - Absolute - *pal-amdhsa*
410 - xnack flat - *pal-amdpal*
412 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500
413 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT
414 - xnack scratch - *pal-amdpal*
415 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA*
416 - wavefrontsize64 flat - *pal-amdhsa*
417 - xnack scratch - *pal-amdpal* .. TODO::
422 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
423 -----------------------------------------------------------------------------------------------------------------------
424 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800
425 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT
426 scratch - *pal-amdpal* - Radeon RX 6900 XT
427 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT
428 - wavefrontsize64 flat - *pal-amdhsa*
429 scratch - *pal-amdpal*
430 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA*
431 - wavefrontsize64 flat - *pal-amdhsa*
432 scratch - *pal-amdpal* .. TODO::
437 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
438 - wavefrontsize64 flat
443 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA*
444 - wavefrontsize64 flat
450 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
451 - wavefrontsize64 flat
456 ``gfx1036`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
457 - wavefrontsize64 flat
463 **GCN GFX11 (RDNA 3)** [AMD-GCN-GFX11-RDNA3]_
464 -----------------------------------------------------------------------------------------------------------------------
465 ``gfx1100`` ``amdgcn`` dGPU - cumode - Architected - *pal-amdpal* *TBA*
466 - wavefrontsize64 flat
469 work-item Add product
472 ``gfx1101`` ``amdgcn`` dGPU - cumode - Architected *TBA*
473 - wavefrontsize64 flat
476 work-item Add product
479 ``gfx1102`` ``amdgcn`` dGPU - cumode - Architected *TBA*
480 - wavefrontsize64 flat
483 work-item Add product
486 ``gfx1103`` ``amdgcn`` APU - cumode - Architected *TBA*
487 - wavefrontsize64 flat
490 work-item Add product
493 =========== =============== ============ ===== ================= =============== =============== ======================
495 .. _amdgpu-target-features:
500 Target features control how code is generated to support certain
501 processor specific features. Not all target features are supported by
502 all processors. The runtime must ensure that the features supported by
503 the device used to execute the code match the features enabled when
504 generating the code. A mismatch of features may result in incorrect
505 execution, or a reduction in performance.
507 The target features supported by each processor is listed in
508 :ref:`amdgpu-processor-table`.
510 Target features are controlled by exactly one of the following Clang
513 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
515 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
516 optional components of the target ID. If omitted, the target feature has the
517 ``any`` value. See :ref:`amdgpu-target-id`.
519 ``-m[no-]<target-feature>``
521 Target features not specified by the target ID are specified using a
522 separate option. These target features can have an ``on`` or ``off``
523 value. ``on`` is specified by omitting the ``no-`` prefix, and
524 ``off`` is specified by including the ``no-`` prefix. The default
525 if not specified is ``off``.
529 ``-mcpu=gfx908:xnack+``
530 Enable the ``xnack`` feature.
531 ``-mcpu=gfx908:xnack-``
532 Disable the ``xnack`` feature.
534 Enable the ``cumode`` feature.
536 Disable the ``cumode`` feature.
538 .. table:: AMDGPU Target Features
539 :name: amdgpu-target-features-table
541 =============== ============================ ==================================================
542 Target Feature Clang Option to Control Description
544 =============== ============================ ==================================================
545 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
546 when generating code for kernels. When disabled
547 native WGP wavefront execution mode is used,
548 when enabled CU wavefront execution mode is used
549 (see :ref:`amdgpu-amdhsa-memory-model`).
551 sramecc - ``-mcpu`` If specified, generate code that can only be
552 - ``--offload-arch`` loaded and executed in a process that has a
553 matching setting for SRAMECC.
555 If not specified for code object V2 to V3, generate
556 code that can be loaded and executed in a process
557 with SRAMECC enabled.
559 If not specified for code object V4 or above, generate
560 code that can be loaded and executed in a process
561 with either setting of SRAMECC.
563 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
564 work-groups are launched in threadgroup split mode.
565 When enabled the waves of a work-group may be
566 launched in different CUs.
568 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
569 generating code for kernels. When disabled
570 native wavefront size 32 is used, when enabled
571 wavefront size 64 is used.
573 xnack - ``-mcpu`` If specified, generate code that can only be
574 - ``--offload-arch`` loaded and executed in a process that has a
575 matching setting for XNACK replay.
577 If not specified for code object V2 to V3, generate
578 code that can be loaded and executed in a process
579 with XNACK replay enabled.
581 If not specified for code object V4 or above, generate
582 code that can be loaded and executed in a process
583 with either setting of XNACK replay.
585 XNACK replay can be used for demand paging and
586 page migration. If enabled in the device, then if
587 a page fault occurs the code may execute
588 incorrectly unless generated with XNACK replay
589 enabled, or generated for code object V4 or above without
590 specifying XNACK replay. Executing code that was
591 generated with XNACK replay enabled, or generated
592 for code object V4 or above without specifying XNACK replay,
593 on a device that does not have XNACK replay
594 enabled will execute correctly but may be less
595 performant than code generated for XNACK replay
597 =============== ============================ ==================================================
599 .. _amdgpu-target-id:
604 AMDGPU supports target IDs. See `Clang Offload Bundler
605 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
606 description. The AMDGPU target specific information is:
609 Is an AMDGPU processor or alternative processor name specified in
610 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
611 the primary processor and alternative processor names. The canonical form
612 target ID only allow the primary processor name.
615 Is a target feature name specified in :ref:`amdgpu-target-features-table` that
616 is supported by the processor. The target features supported by each processor
617 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
618 a target ID are marked as being controlled by ``-mcpu`` and
619 ``--offload-arch``. Each target feature must appear at most once in a target
620 ID. The non-canonical form target ID allows the target features to be
621 specified in any order. The canonical form target ID requires the target
622 features to be specified in alphabetic order.
624 .. _amdgpu-target-id-v2-v3:
626 Code Object V2 to V3 Target ID
627 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
629 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
630 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
631 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
632 directive and the bundle entry ID. In those cases it has the following BNF
637 <target-id> ::== <processor> ( "+" <target-feature> )*
639 Where a target feature is omitted if *Off* and present if *On* or *Any*.
643 The code object V2 to V3 cannot represent *Any* and treats it the same as
646 .. _amdgpu-embedding-bundled-objects:
648 Embedding Bundled Code Objects
649 ------------------------------
651 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
652 as described in `Clang Offload Bundler
653 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
657 The target ID syntax used for code object V2 to V3 for a bundle entry ID
658 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
660 .. _amdgpu-address-spaces:
665 The AMDGPU architecture supports a number of memory address spaces. The address
666 space names use the OpenCL standard names, with some additions.
668 The AMDGPU address spaces correspond to target architecture specific LLVM
669 address space numbers used in LLVM IR.
671 The AMDGPU address spaces are described in
672 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
673 supported for the ``amdgcn`` target.
675 .. table:: AMDGPU Address Spaces
676 :name: amdgpu-address-spaces-table
678 ================================= =============== =========== ================ ======= ============================
679 .. 64-Bit Process Address Space
680 --------------------------------- --------------- ----------- ---------------- ------------------------------------
681 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value
682 Space Number Name Name Size
683 ================================= =============== =========== ================ ======= ============================
684 Generic 0 flat flat 64 0x0000000000000000
685 Global 1 global global 64 0x0000000000000000
686 Region 2 N/A GDS 32 *not implemented for AMDHSA*
687 Local 3 group LDS 32 0xFFFFFFFF
688 Constant 4 constant *same as global* 64 0x0000000000000000
689 Private 5 private scratch 32 0xFFFFFFFF
690 Constant 32-bit 6 *TODO* 0x00000000
691 Buffer Fat Pointer (experimental) 7 *TODO*
692 Buffer Resource (experimental) 8 *TODO*
693 Streamout Registers 128 N/A GS_REGS
694 ================================= =============== =========== ================ ======= ============================
697 The generic address space is supported unless the *Target Properties* column
698 of :ref:`amdgpu-processor-table` specifies *Does not support generic address
701 The generic address space uses the hardware flat address support for two fixed
702 ranges of virtual addresses (the private and local apertures), that are
703 outside the range of addressable global memory, to map from a flat address to
704 a private or local address. This uses FLAT instructions that can take a flat
705 address and access global, private (scratch), and group (LDS) memory depending
706 on if the address is within one of the aperture ranges.
708 Flat access to scratch requires hardware aperture setup and setup in the
709 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
710 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
711 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
713 To convert between a private or group address space address (termed a segment
714 address) and a flat address the base address of the corresponding aperture
715 can be used. For GFX7-GFX8 these are available in the
716 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
717 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
718 GFX9-GFX11 the aperture base addresses are directly available as inline
719 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
720 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
721 aligned to 2^32 which makes it easier to convert from flat to segment or
724 A global address space address has the same value when used as a flat address
725 so no conversion is needed.
727 **Global and Constant**
728 The global and constant address spaces both use global virtual addresses,
729 which are the same virtual address space used by the CPU. However, some
730 virtual addresses may only be accessible to the CPU, some only accessible
731 by the GPU, and some by both.
733 Using the constant address space indicates that the data will not change
734 during the execution of the kernel. This allows scalar read instructions to
735 be used. As the constant address space could only be modified on the host
736 side, a generic pointer loaded from the constant address space is safe to be
737 assumed as a global pointer since only the device global memory is visible
738 and managed on the host side. The vector and scalar L1 caches are invalidated
739 of volatile data before each kernel dispatch execution to allow constant
740 memory to change values between kernel dispatches.
743 The region address space uses the hardware Global Data Store (GDS). All
744 wavefronts executing on the same device will access the same memory for any
745 given region address. However, the same region address accessed by wavefronts
746 executing on different devices will access different memory. It is higher
747 performance than global memory. It is allocated by the runtime. The data
748 store (DS) instructions can be used to access it.
751 The local address space uses the hardware Local Data Store (LDS) which is
752 automatically allocated when the hardware creates the wavefronts of a
753 work-group, and freed when all the wavefronts of a work-group have
754 terminated. All wavefronts belonging to the same work-group will access the
755 same memory for any given local address. However, the same local address
756 accessed by wavefronts belonging to different work-groups will access
757 different memory. It is higher performance than global memory. The data store
758 (DS) instructions can be used to access it.
761 The private address space uses the hardware scratch memory support which
762 automatically allocates memory when it creates a wavefront and frees it when
763 a wavefronts terminates. The memory accessed by a lane of a wavefront for any
764 given private address will be different to the memory accessed by another lane
765 of the same or different wavefront for the same private address.
767 If a kernel dispatch uses scratch, then the hardware allocates memory from a
768 pool of backing memory allocated by the runtime for each wavefront. The lanes
769 of the wavefront access this using dword (4 byte) interleaving. The mapping
770 used from private address to backing memory address is:
772 ``wavefront-scratch-base +
773 ((private-address / 4) * wavefront-size * 4) +
774 (wavefront-lane-id * 4) + (private-address % 4)``
776 If each lane of a wavefront accesses the same private address, the
777 interleaving results in adjacent dwords being accessed and hence requires
778 fewer cache lines to be fetched.
780 There are different ways that the wavefront scratch base address is
781 determined by a wavefront (see
782 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
784 Scratch memory can be accessed in an interleaved manner using buffer
785 instructions with the scratch buffer descriptor and per wavefront scratch
786 offset, by the scratch instructions, or by flat instructions. Multi-dword
787 access is not supported except by flat and scratch instructions in
793 **Buffer Fat Pointer**
794 The buffer fat pointer is an experimental address space that is currently
795 unsupported in the backend. It exposes a non-integral pointer that is in
796 the future intended to support the modelling of 128-bit buffer descriptors
797 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
798 *pointer*), allowing normal LLVM load/store/atomic operations to be used to
799 model the buffer descriptors used heavily in graphics workloads targeting
802 The buffer descriptor used to construct a buffer fat pointer must be *raw*:
803 the stride must be 0, the "add tid" flag bust be 0, the swizzle enable bits
804 must be off, and the extent must be measured in bytes. (On subtargets where
805 bounds checking may be disabled, buffer fat pointers may choose to enable
809 The buffer resource is an experimental address space that is currently unsupported
810 in the backend. It exposes a non-integral pointer that will represent a 128-bit
811 buffer descriptor resource.
813 Since, in general, a buffer resource supports complex addressing modes that cannot
814 be easily represented in LLVM (such as implicit swizzled access to structured
815 buffers), it is **illegal** to perform non-trivial address computations, such as
816 ``getelementptr`` operations, on buffer resources. They may be passed to
817 AMDGPU buffer intrinsics, and they may be converted to and from ``i128``.
819 Casting a buffer resource to a buffer fat pointer is permitted and adds an offset
822 **Streamout Registers**
823 Dedicated registers used by the GS NGG Streamout Instructions. The register
824 file is modelled as a memory in a distinct address space because it is indexed
825 by an address-like offset in place of named registers, and because register
826 accesses affect LGKMcnt. This is an internal address space used only by the
827 compiler. Do not use this address space for IR pointers.
829 .. _amdgpu-memory-scopes:
834 This section provides LLVM memory synchronization scopes supported by the AMDGPU
835 backend memory model when the target triple OS is ``amdhsa`` (see
836 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
838 The memory model supported is based on the HSA memory model [HSA]_ which is
839 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
840 relation is transitive over the synchronizes-with relation independent of scope
841 and synchronizes-with allows the memory scope instances to be inclusive (see
842 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
844 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
845 inclusion and requires the memory scopes to exactly match. However, this
846 is conservatively correct for OpenCL.
848 .. table:: AMDHSA LLVM Sync Scopes
849 :name: amdgpu-amdhsa-llvm-sync-scopes-table
851 ======================= ===================================================
852 LLVM Sync Scope Description
853 ======================= ===================================================
854 *none* The default: ``system``.
856 Synchronizes with, and participates in modification
857 and seq_cst total orderings with, other operations
858 (except image operations) for all address spaces
859 (except private, or generic that accesses private)
860 provided the other operation's sync scope is:
863 - ``agent`` and executed by a thread on the same
865 - ``workgroup`` and executed by a thread in the
867 - ``wavefront`` and executed by a thread in the
870 ``agent`` Synchronizes with, and participates in modification
871 and seq_cst total orderings with, other operations
872 (except image operations) for all address spaces
873 (except private, or generic that accesses private)
874 provided the other operation's sync scope is:
876 - ``system`` or ``agent`` and executed by a thread
878 - ``workgroup`` and executed by a thread in the
880 - ``wavefront`` and executed by a thread in the
883 ``workgroup`` Synchronizes with, and participates in modification
884 and seq_cst total orderings with, other operations
885 (except image operations) for all address spaces
886 (except private, or generic that accesses private)
887 provided the other operation's sync scope is:
889 - ``system``, ``agent`` or ``workgroup`` and
890 executed by a thread in the same work-group.
891 - ``wavefront`` and executed by a thread in the
894 ``wavefront`` Synchronizes with, and participates in modification
895 and seq_cst total orderings with, other operations
896 (except image operations) for all address spaces
897 (except private, or generic that accesses private)
898 provided the other operation's sync scope is:
900 - ``system``, ``agent``, ``workgroup`` or
901 ``wavefront`` and executed by a thread in the
904 ``singlethread`` Only synchronizes with and participates in
905 modification and seq_cst total orderings with,
906 other operations (except image operations) running
907 in the same thread for all address spaces (for
908 example, in signal handlers).
910 ``one-as`` Same as ``system`` but only synchronizes with other
911 operations within the same address space.
913 ``agent-one-as`` Same as ``agent`` but only synchronizes with other
914 operations within the same address space.
916 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with
917 other operations within the same address space.
919 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with
920 other operations within the same address space.
922 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
923 other operations within the same address space.
924 ======================= ===================================================
929 The AMDGPU backend implements the following LLVM IR intrinsics.
931 *This section is WIP.*
935 List AMDGPU intrinsics.
940 The AMDGPU backend supports the following LLVM IR attributes.
942 .. table:: AMDGPU LLVM IR Attributes
943 :name: amdgpu-llvm-ir-attributes-table
945 ======================================= ==========================================================
946 LLVM Attribute Description
947 ======================================= ==========================================================
948 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
949 will be specified when the kernel is dispatched. Generated
950 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
951 The implied default value is 1,1024.
953 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
954 argument block size for the implicit arguments. This
955 varies by OS and language (for OpenCL see
956 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
957 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
958 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
959 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
960 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
961 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
962 execution unit. Generated by the ``amdgpu_waves_per_eu``
963 CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
964 and the backend may not be able to satisfy the request. If
965 the specified range is incompatible with the function's
966 "amdgpu-flat-work-group-size" value, the implied occupancy
967 bounds by the workgroup size takes precedence.
969 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the
970 mode register to be set on entry. Overrides the default for
971 the calling convention.
972 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of
973 the mode register to be set on entry. Overrides the default
974 for the calling convention.
976 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the
977 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
978 attribute, or reached through a call site marked with this attribute,
979 the value returned by the intrinsic is undefined. The backend can
980 generally infer this during code generation, so typically there is no
981 benefit to frontends marking functions with this.
983 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the
984 llvm.amdgcn.workitem.id.y intrinsic.
986 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the
987 llvm.amdgcn.workitem.id.z intrinsic.
989 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the
990 llvm.amdgcn.workgroup.id.x intrinsic.
992 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the
993 llvm.amdgcn.workgroup.id.y intrinsic.
995 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the
996 llvm.amdgcn.workgroup.id.z intrinsic.
998 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the
999 llvm.amdgcn.dispatch.ptr intrinsic.
1001 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the
1002 llvm.amdgcn.implicitarg.ptr intrinsic.
1004 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the
1005 llvm.amdgcn.dispatch.id intrinsic.
1007 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the
1008 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
1009 attributes, the queue pointer may be required in situations where the
1010 intrinsic call does not directly appear in the program. Some subtargets
1011 require the queue pointer for to handle some addrspacecasts, as well
1012 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
1013 llvm.debug intrinsics.
1015 "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1016 kernel argument that holds the pointer to the hostcall buffer. If this
1017 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1019 "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1020 kernel argument that holds the pointer to an initialized memory buffer
1021 that conforms to the requirements of the malloc/free device library V1
1022 version implementation. If this attribute is absent, then the
1023 amdgpu-no-implicitarg-ptr is also removed.
1025 "amdgpu-no-multigrid-sync-arg" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1026 kernel argument that holds the multigrid synchronization pointer. If this
1027 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1029 "amdgpu-no-default-queue" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1030 kernel argument that holds the default queue pointer. If this
1031 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1033 "amdgpu-no-completion-action" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
1034 kernel argument that holds the completion action pointer. If this
1035 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
1037 ======================================= ==========================================================
1039 .. _amdgpu-elf-code-object:
1044 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
1045 can be linked by ``lld`` to produce a standard ELF shared code object which can
1046 be loaded and executed on an AMDGPU target.
1048 .. _amdgpu-elf-header:
1053 The AMDGPU backend uses the following ELF header:
1055 .. table:: AMDGPU ELF Header
1056 :name: amdgpu-elf-header-table
1058 ========================== ===============================
1060 ========================== ===============================
1061 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
1062 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
1063 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
1064 - ``ELFOSABI_AMDGPU_HSA``
1065 - ``ELFOSABI_AMDGPU_PAL``
1066 - ``ELFOSABI_AMDGPU_MESA3D``
1067 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1068 - ``ELFABIVERSION_AMDGPU_HSA_V3``
1069 - ``ELFABIVERSION_AMDGPU_HSA_V4``
1070 - ``ELFABIVERSION_AMDGPU_HSA_V5``
1071 - ``ELFABIVERSION_AMDGPU_PAL``
1072 - ``ELFABIVERSION_AMDGPU_MESA3D``
1073 ``e_type`` - ``ET_REL``
1075 ``e_machine`` ``EM_AMDGPU``
1077 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1078 :ref:`amdgpu-elf-header-e_flags-table-v3`,
1079 and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
1080 ========================== ===============================
1084 .. table:: AMDGPU ELF Header Enumeration Values
1085 :name: amdgpu-elf-header-enumeration-values-table
1087 =============================== =====
1089 =============================== =====
1092 ``ELFOSABI_AMDGPU_HSA`` 64
1093 ``ELFOSABI_AMDGPU_PAL`` 65
1094 ``ELFOSABI_AMDGPU_MESA3D`` 66
1095 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1096 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1097 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1098 ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1099 ``ELFABIVERSION_AMDGPU_PAL`` 0
1100 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1101 =============================== =====
1103 ``e_ident[EI_CLASS]``
1106 * ``ELFCLASS32`` for ``r600`` architecture.
1108 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1109 process address space applications.
1111 ``e_ident[EI_DATA]``
1112 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1114 ``e_ident[EI_OSABI]``
1115 One of the following AMDGPU target architecture specific OS ABIs
1116 (see :ref:`amdgpu-os`):
1118 * ``ELFOSABI_NONE`` for *unknown* OS.
1120 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1122 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1124 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1126 ``e_ident[EI_ABIVERSION]``
1127 The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1130 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1131 runtime ABI for code object V2. Specify using the Clang option
1132 ``-mcode-object-version=2``.
1134 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1135 runtime ABI for code object V3. Specify using the Clang option
1136 ``-mcode-object-version=3``.
1138 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1139 runtime ABI for code object V4. Specify using the Clang option
1140 ``-mcode-object-version=4``. This is the default code object
1141 version if not specified.
1143 * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1144 runtime ABI for code object V5. Specify using the Clang option
1145 ``-mcode-object-version=5``.
1147 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1150 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1154 Can be one of the following values:
1158 The type produced by the AMDGPU backend compiler as it is relocatable code
1162 The type produced by the linker as it is a shared code object.
1164 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1167 The value ``EM_AMDGPU`` is used for the machine for all processors supported
1168 by the ``r600`` and ``amdgcn`` architectures (see
1169 :ref:`amdgpu-processor-table`). The specific processor is specified in the
1170 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1171 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1172 ``e_flags`` for code object V3 and above (see
1173 :ref:`amdgpu-elf-header-e_flags-table-v3` and
1174 :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
1177 The entry point is 0 as the entry points for individual kernels must be
1178 selected in order to invoke them through AQL packets.
1181 The AMDGPU backend uses the following ELF header flags:
1183 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1184 :name: amdgpu-elf-header-e_flags-v2-table
1186 ===================================== ===== =============================
1187 Name Value Description
1188 ===================================== ===== =============================
1189 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack``
1191 enabled for all code
1192 contained in the code object.
1194 does not support the
1199 :ref:`amdgpu-target-features`.
1200 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap
1201 handler is enabled for all
1202 code contained in the code
1203 object. If the processor
1204 does not support a trap
1205 handler then must be 0.
1207 :ref:`amdgpu-target-features`.
1208 ===================================== ===== =============================
1210 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1211 :name: amdgpu-elf-header-e_flags-table-v3
1213 ================================= ===== =============================
1214 Name Value Description
1215 ================================= ===== =============================
1216 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1218 ``EF_AMDGPU_MACH_xxx`` values
1220 :ref:`amdgpu-ef-amdgpu-mach-table`.
1221 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack``
1223 enabled for all code
1224 contained in the code object.
1226 does not support the
1231 :ref:`amdgpu-target-features`.
1232 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc``
1234 enabled for all code
1235 contained in the code object.
1237 does not support the
1242 :ref:`amdgpu-target-features`.
1243 ================================= ===== =============================
1245 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
1246 :name: amdgpu-elf-header-e_flags-table-v4-onwards
1248 ============================================ ===== ===================================
1249 Name Value Description
1250 ============================================ ===== ===================================
1251 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1253 ``EF_AMDGPU_MACH_xxx`` values
1255 :ref:`amdgpu-ef-amdgpu-mach-table`.
1256 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for
1257 ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1259 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsupported.
1260 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value.
1261 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled.
1262 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled.
1263 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for
1264 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1266 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsupported.
1267 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value.
1268 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled,
1269 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled.
1270 ============================================ ===== ===================================
1272 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1273 :name: amdgpu-ef-amdgpu-mach-table
1275 ==================================== ========== =============================
1276 Name Value Description (see
1277 :ref:`amdgpu-processor-table`)
1278 ==================================== ========== =============================
1279 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
1280 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
1281 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
1282 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
1283 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
1284 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
1285 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
1286 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
1287 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
1288 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
1289 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
1290 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
1291 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
1292 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
1293 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
1294 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
1295 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
1296 *reserved* 0x011 - Reserved for ``r600``
1297 0x01f architecture processors.
1298 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
1299 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
1300 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
1301 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
1302 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
1303 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
1304 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
1305 *reserved* 0x027 Reserved.
1306 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
1307 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
1308 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
1309 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
1310 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
1311 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
1312 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
1313 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
1314 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908``
1315 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
1316 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c``
1317 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010``
1318 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011``
1319 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012``
1320 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030``
1321 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031``
1322 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032``
1323 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033``
1324 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602``
1325 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705``
1326 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805``
1327 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035``
1328 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034``
1329 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a``
1330 ``EF_AMDGPU_MACH_AMDGCN_GFX940`` 0x040 ``gfx940``
1331 ``EF_AMDGPU_MACH_AMDGCN_GFX1100`` 0x041 ``gfx1100``
1332 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013``
1333 *reserved* 0x043 Reserved.
1334 ``EF_AMDGPU_MACH_AMDGCN_GFX1103`` 0x044 ``gfx1103``
1335 ``EF_AMDGPU_MACH_AMDGCN_GFX1036`` 0x045 ``gfx1036``
1336 ``EF_AMDGPU_MACH_AMDGCN_GFX1101`` 0x046 ``gfx1101``
1337 ``EF_AMDGPU_MACH_AMDGCN_GFX1102`` 0x047 ``gfx1102``
1338 *reserved* 0x048 Reserved.
1339 *reserved* 0x049 Reserved.
1340 *reserved* 0x04a Reserved.
1341 ``EF_AMDGPU_MACH_AMDGCN_GFX941`` 0x04b ``gfx941``
1342 ``EF_AMDGPU_MACH_AMDGCN_GFX942`` 0x04c ``gfx942``
1343 ==================================== ========== =============================
1348 An AMDGPU target ELF code object has the standard ELF sections which include:
1350 .. table:: AMDGPU ELF Sections
1351 :name: amdgpu-elf-sections-table
1353 ================== ================ =================================
1354 Name Type Attributes
1355 ================== ================ =================================
1356 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1357 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1358 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
1359 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
1360 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1361 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1362 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1363 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
1364 ``.note`` ``SHT_NOTE`` *none*
1365 ``.rela``\ *name* ``SHT_RELA`` *none*
1366 ``.rela.dyn`` ``SHT_RELA`` *none*
1367 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1368 ``.shstrtab`` ``SHT_STRTAB`` *none*
1369 ``.strtab`` ``SHT_STRTAB`` *none*
1370 ``.symtab`` ``SHT_SYMTAB`` *none*
1371 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1372 ================== ================ =================================
1374 These sections have their standard meanings (see [ELF]_) and are only generated
1378 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1379 information on the DWARF produced by the AMDGPU backend.
1381 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1382 The standard sections used by a dynamic loader.
1385 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1388 ``.rela``\ *name*, ``.rela.dyn``
1389 For relocatable code objects, *name* is the name of the section that the
1390 relocation records apply. For example, ``.rela.text`` is the section name for
1391 relocation records associated with the ``.text`` section.
1393 For linked shared code objects, ``.rela.dyn`` contains all the relocation
1394 records from each of the relocatable code object's ``.rela``\ *name* sections.
1396 See :ref:`amdgpu-relocation-records` for the relocation records supported by
1400 The executable machine code for the kernels and functions they call. Generated
1401 as position independent code. See :ref:`amdgpu-code-conventions` for
1402 information on conventions used in the isa generation.
1404 .. _amdgpu-note-records:
1409 The AMDGPU backend code object contains ELF note records in the ``.note``
1410 section. The set of generated notes and their semantics depend on the code
1411 object version; see :ref:`amdgpu-note-records-v2` and
1412 :ref:`amdgpu-note-records-v3-onwards`.
1414 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1415 must be generated after the ``name`` field to ensure the ``desc`` field is 4
1416 byte aligned. In addition, minimal zero-byte padding must be generated to
1417 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1418 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1421 .. _amdgpu-note-records-v2:
1423 Code Object V2 Note Records
1424 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1427 Code object V2 is not the default code object version emitted by
1428 this version of LLVM.
1430 The AMDGPU backend code object uses the following ELF note record in the
1431 ``.note`` section when compiling for code object V2.
1433 The note record vendor field is "AMD".
1435 Additional note records may be present, but any which are not documented here
1436 are deprecated and should not be used.
1438 .. table:: AMDGPU Code Object V2 ELF Note Records
1439 :name: amdgpu-elf-note-records-v2-table
1441 ===== ===================================== ======================================
1442 Name Type Description
1443 ===== ===================================== ======================================
1444 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version.
1445 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL
1446 Finalizer and not the LLVM compiler.
1447 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version.
1448 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in
1449 YAML [YAML]_ textual format.
1450 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name.
1451 ===== ===================================== ======================================
1455 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1456 :name: amdgpu-elf-note-record-enumeration-values-v2-table
1458 ===================================== =====
1460 ===================================== =====
1461 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1
1462 ``NT_AMD_HSA_HSAIL`` 2
1463 ``NT_AMD_HSA_ISA_VERSION`` 3
1465 ``NT_AMD_HSA_METADATA`` 10
1466 ``NT_AMD_HSA_ISA_NAME`` 11
1467 ===================================== =====
1469 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
1470 Specifies the code object version number. The description field has the
1475 struct amdgpu_hsa_note_code_object_version_s {
1476 uint32_t major_version;
1477 uint32_t minor_version;
1480 The ``major_version`` has a value less than or equal to 2.
1482 ``NT_AMD_HSA_HSAIL``
1483 Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1484 field has the following layout:
1488 struct amdgpu_hsa_note_hsail_s {
1489 uint32_t hsail_major_version;
1490 uint32_t hsail_minor_version;
1492 uint8_t machine_model;
1493 uint8_t default_float_round;
1496 ``NT_AMD_HSA_ISA_VERSION``
1497 Specifies the target ISA version. The description field has the following layout:
1501 struct amdgpu_hsa_note_isa_s {
1502 uint16_t vendor_name_size;
1503 uint16_t architecture_name_size;
1507 char vendor_and_architecture_name[1];
1510 ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1511 vendor and architecture names respectively, including the NUL character.
1513 ``vendor_and_architecture_name`` contains the NUL terminates string for the
1514 vendor, immediately followed by the NUL terminated string for the
1517 This note record is used by the HSA runtime loader.
1519 Code object V2 only supports a limited number of processors and has fixed
1520 settings for target features. See
1521 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1522 processors and the corresponding target ID. In the table the note record ISA
1523 name is a concatenation of the vendor name, architecture name, major, minor,
1524 and stepping separated by a ":".
1526 The target ID column shows the processor name and fixed target features used
1527 by the LLVM compiler. The LLVM compiler does not generate a
1528 ``NT_AMD_HSA_HSAIL`` note record.
1530 A code object generated by the Finalizer also uses code object V2 and always
1531 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1532 ``sramecc`` target feature is as shown in
1533 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1534 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1537 ``NT_AMD_HSA_ISA_NAME``
1538 Specifies the target ISA name as a non-NUL terminated string.
1540 This note record is not used by the HSA runtime loader.
1542 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1543 V2's limited support of processors and fixed settings for target features.
1545 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1546 from the string to the corresponding target ID. If the ``xnack`` target
1547 feature is supported and enabled, the string produced by the LLVM compiler
1548 will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1549 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1551 ``NT_AMD_HSA_METADATA``
1552 Specifies extensible metadata associated with the code objects executed on HSA
1553 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1554 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1555 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1558 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1559 :name: amdgpu-elf-note-record-supported_processors-v2-table
1561 ===================== ==========================
1562 Note Record ISA Name Target ID
1563 ===================== ==========================
1564 ``AMD:AMDGPU:6:0:0`` ``gfx600``
1565 ``AMD:AMDGPU:6:0:1`` ``gfx601``
1566 ``AMD:AMDGPU:6:0:2`` ``gfx602``
1567 ``AMD:AMDGPU:7:0:0`` ``gfx700``
1568 ``AMD:AMDGPU:7:0:1`` ``gfx701``
1569 ``AMD:AMDGPU:7:0:2`` ``gfx702``
1570 ``AMD:AMDGPU:7:0:3`` ``gfx703``
1571 ``AMD:AMDGPU:7:0:4`` ``gfx704``
1572 ``AMD:AMDGPU:7:0:5`` ``gfx705``
1573 ``AMD:AMDGPU:8:0:0`` ``gfx802``
1574 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+``
1575 ``AMD:AMDGPU:8:0:2`` ``gfx802``
1576 ``AMD:AMDGPU:8:0:3`` ``gfx803``
1577 ``AMD:AMDGPU:8:0:4`` ``gfx803``
1578 ``AMD:AMDGPU:8:0:5`` ``gfx805``
1579 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+``
1580 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-``
1581 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+``
1582 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-``
1583 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+``
1584 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-``
1585 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+``
1586 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-``
1587 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+``
1588 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1589 ===================== ==========================
1591 .. _amdgpu-note-records-v3-onwards:
1593 Code Object V3 and Above Note Records
1594 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1596 The AMDGPU backend code object uses the following ELF note record in the
1597 ``.note`` section when compiling for code object V3 and above.
1599 The note record vendor field is "AMDGPU".
1601 Additional note records may be present, but any which are not documented here
1602 are deprecated and should not be used.
1604 .. table:: AMDGPU Code Object V3 and Above ELF Note Records
1605 :name: amdgpu-elf-note-records-table-v3-onwards
1607 ======== ============================== ======================================
1608 Name Type Description
1609 ======== ============================== ======================================
1610 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
1612 ======== ============================== ======================================
1616 .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
1617 :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
1619 ============================== =====
1621 ============================== =====
1623 ``NT_AMDGPU_METADATA`` 32
1624 ============================== =====
1626 ``NT_AMDGPU_METADATA``
1627 Specifies extensible metadata associated with an AMDGPU code object. It is
1628 encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1629 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
1630 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
1631 :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
1639 Symbols include the following:
1641 .. table:: AMDGPU ELF Symbols
1642 :name: amdgpu-elf-symbols-table
1644 ===================== ================== ================ ==================
1645 Name Type Section Description
1646 ===================== ================== ================ ==================
1647 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
1650 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
1651 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
1652 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS
1653 ===================== ================== ================ ==================
1656 Global variables both used and defined by the compilation unit.
1658 If the symbol is defined in the compilation unit then it is allocated in the
1659 appropriate section according to if it has initialized data or is readonly.
1661 If the symbol is external then its section is ``STN_UNDEF`` and the loader
1662 will resolve relocations using the definition provided by another code object
1663 or explicitly defined by the runtime.
1665 If the symbol resides in local/group memory (LDS) then its section is the
1666 special processor specific section name ``SHN_AMDGPU_LDS``, and the
1667 ``st_value`` field describes alignment requirements as it does for common
1672 Add description of linked shared object symbols. Seems undefined symbols
1673 are marked as STT_NOTYPE.
1676 Every HSA kernel has an associated kernel descriptor. It is the address of the
1677 kernel descriptor that is used in the AQL dispatch packet used to invoke the
1678 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1679 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1682 Every HSA kernel also has a symbol for its machine code entry point.
1684 .. _amdgpu-relocation-records:
1689 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1690 relocatable fields are:
1693 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1694 alignment. These values use the same byte order as other word values in the
1695 AMDGPU architecture.
1698 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1699 alignment. These values use the same byte order as other word values in the
1700 AMDGPU architecture.
1702 Following notations are used for specifying relocation calculations:
1705 Represents the addend used to compute the value of the relocatable field.
1708 Represents the offset into the global offset table at which the relocation
1709 entry's symbol will reside during execution.
1712 Represents the address of the global offset table.
1715 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1716 of the storage unit being relocated (computed using ``r_offset``).
1719 Represents the value of the symbol whose index resides in the relocation
1720 entry. Relocations not using this must specify a symbol index of
1724 Represents the base address of a loaded executable or shared object which is
1725 the difference between the ELF address and the actual load address.
1726 Relocations using this are only valid in executable or shared objects.
1728 The following relocation types are supported:
1730 .. table:: AMDGPU ELF Relocation Records
1731 :name: amdgpu-elf-relocation-records-table
1733 ========================== ======= ===== ========== ==============================
1734 Relocation Type Kind Value Field Calculation
1735 ========================== ======= ===== ========== ==============================
1736 ``R_AMDGPU_NONE`` 0 *none* *none*
1737 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
1739 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
1741 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
1743 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
1744 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
1745 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
1747 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
1748 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
1749 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
1750 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
1751 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
1753 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
1754 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4
1755 ========================== ======= ===== ========== ==============================
1757 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1758 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1760 There is no current OS loader support for 32-bit programs and so
1761 ``R_AMDGPU_ABS32`` is not used.
1763 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1765 Loaded Code Object Path Uniform Resource Identifier (URI)
1766 ---------------------------------------------------------
1768 The AMD GPU code object loader represents the path of the ELF shared object from
1769 which the code object was loaded as a textual Uniform Resource Identifier (URI).
1770 Note that the code object is the in memory loaded relocated form of the ELF
1771 shared object. Multiple code objects may be loaded at different memory
1772 addresses in the same process from the same ELF shared object.
1774 The loaded code object path URI syntax is defined by the following BNF syntax:
1778 code_object_uri ::== file_uri | memory_uri
1779 file_uri ::== "file://" file_path [ range_specifier ]
1780 memory_uri ::== "memory://" process_id range_specifier
1781 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1782 file_path ::== URI_ENCODED_OS_FILE_PATH
1783 process_id ::== DECIMAL_NUMBER
1784 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1787 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1788 and octal values by "0".
1791 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1792 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1793 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in
1794 the path are separated by "/".
1797 Is a 0-based byte offset to the start of the code object. For a file URI, it
1798 is from the start of the file specified by the ``file_path``, and if omitted
1799 defaults to 0. For a memory URI, it is the memory address and is required.
1802 Is the number of bytes in the code object. For a file URI, if omitted it
1803 defaults to the size of the file. It is required for a memory URI.
1806 Is the identity of the process owning the memory. For Linux it is the C
1807 unsigned integral decimal literal for the process ID (PID).
1813 file:///dir1/dir2/file1
1814 file:///dir3/dir4/file2#offset=0x2000&size=3000
1815 memory://1234#offset=0x20000&size=3000
1817 .. _amdgpu-dwarf-debug-information:
1819 DWARF Debug Information
1820 =======================
1824 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1825 is not currently fully implemented and is subject to change.
1827 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1828 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
1829 object executable code and data to the source language constructs. It can be
1830 used by tools such as debuggers and profilers. It uses features defined in
1831 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1832 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1834 This section defines the AMDGPU target architecture specific DWARF mappings.
1836 .. _amdgpu-dwarf-register-identifier:
1841 This section defines the AMDGPU target architecture register numbers used in
1842 DWARF operation expressions (see DWARF Version 5 section 2.5 and
1843 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1844 instructions (see DWARF Version 5 section 6.4 and
1845 :ref:`amdgpu-dwarf-call-frame-information`).
1847 A single code object can contain code for kernels that have different wavefront
1848 sizes. The vector registers and some scalar registers are based on the wavefront
1849 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1850 simplifies the consumer of the DWARF so that each register has a fixed size,
1851 rather than being dynamic according to the wavefront size mode. Similarly,
1852 distinct DWARF registers are defined for those registers that vary in size
1853 according to the process address size. This allows a consumer to treat a
1854 specific AMDGPU processor as a single architecture regardless of how it is
1855 configured at run time. The compiler explicitly specifies the DWARF registers
1856 that match the mode in which the code it is generating will be executed.
1858 DWARF registers are encoded as numbers, which are mapped to architecture
1859 registers. The mapping for AMDGPU is defined in
1860 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1863 .. table:: AMDGPU DWARF Register Mapping
1864 :name: amdgpu-dwarf-register-mapping-table
1866 ============== ================= ======== ==================================
1867 DWARF Register AMDGPU Register Bit Size Description
1868 ============== ================= ======== ==================================
1869 0 PC_32 32 Program Counter (PC) when
1870 executing in a 32-bit process
1871 address space. Used in the CFI to
1872 describe the PC of the calling
1874 1 EXEC_MASK_32 32 Execution Mask Register when
1875 executing in wavefront 32 mode.
1876 2-15 *Reserved* *Reserved for highly accessed
1877 registers using DWARF shortcut.*
1878 16 PC_64 64 Program Counter (PC) when
1879 executing in a 64-bit process
1880 address space. Used in the CFI to
1881 describe the PC of the calling
1883 17 EXEC_MASK_64 64 Execution Mask Register when
1884 executing in wavefront 64 mode.
1885 18-31 *Reserved* *Reserved for highly accessed
1886 registers using DWARF shortcut.*
1887 32-95 SGPR0-SGPR63 32 Scalar General Purpose
1889 96-127 *Reserved* *Reserved for frequently accessed
1890 registers using DWARF 1-byte ULEB.*
1891 128 STATUS 32 Status Register.
1892 129-511 *Reserved* *Reserved for future Scalar
1893 Architectural Registers.*
1894 512 VCC_32 32 Vector Condition Code Register
1895 when executing in wavefront 32
1897 513-767 *Reserved* *Reserved for future Vector
1898 Architectural Registers when
1899 executing in wavefront 32 mode.*
1900 768 VCC_64 64 Vector Condition Code Register
1901 when executing in wavefront 64
1903 769-1023 *Reserved* *Reserved for future Vector
1904 Architectural Registers when
1905 executing in wavefront 64 mode.*
1906 1024-1087 *Reserved* *Reserved for padding.*
1907 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers.
1908 1130-1535 *Reserved* *Reserved for future Scalar
1909 General Purpose Registers.*
1910 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers
1911 when executing in wavefront 32
1913 1792-2047 *Reserved* *Reserved for future Vector
1914 General Purpose Registers when
1915 executing in wavefront 32 mode.*
1916 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers
1917 when executing in wavefront 32
1919 2304-2559 *Reserved* *Reserved for future Vector
1920 Accumulation Registers when
1921 executing in wavefront 32 mode.*
1922 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers
1923 when executing in wavefront 64
1925 2816-3071 *Reserved* *Reserved for future Vector
1926 General Purpose Registers when
1927 executing in wavefront 64 mode.*
1928 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers
1929 when executing in wavefront 64
1931 3328-3583 *Reserved* *Reserved for future Vector
1932 Accumulation Registers when
1933 executing in wavefront 64 mode.*
1934 ============== ================= ======== ==================================
1936 The vector registers are represented as the full size for the wavefront. They
1937 are organized as consecutive dwords (32-bits), one per lane, with the dword at
1938 the least significant bit position corresponding to lane 0 and so forth. DWARF
1939 location expressions involving the ``DW_OP_LLVM_offset`` and
1940 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1941 register corresponding to the lane that is executing the current thread of
1942 execution in languages that are implemented using a SIMD or SIMT execution
1945 If the wavefront size is 32 lanes then the wavefront 32 mode register
1946 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1947 mode register definitions are used. Some AMDGPU targets support executing in
1948 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1949 to the wavefront mode of the generated code will be used.
1951 If code is generated to execute in a 32-bit process address space, then the
1952 32-bit process address space register definitions are used. If code is generated
1953 to execute in a 64-bit process address space, then the 64-bit process address
1954 space register definitions are used. The ``amdgcn`` target only supports the
1955 64-bit process address space.
1957 .. _amdgpu-dwarf-memory-space-identifier:
1959 Memory Space Identifier
1960 -----------------------
1962 The DWARF memory space represents the source language memory space. See DWARF
1963 Version 5 section 2.12 which is updated by the *DWARF Extensions For
1964 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-memory-spaces`.
1966 The DWARF memory space mapping used for AMDGPU is defined in
1967 :ref:`amdgpu-dwarf-memory-space-mapping-table`.
1969 .. table:: AMDGPU DWARF Memory Space Mapping
1970 :name: amdgpu-dwarf-memory-space-mapping-table
1972 =========================== ====== =================
1974 ---------------------------------- -----------------
1975 Memory Space Name Value Memory Space
1976 =========================== ====== =================
1977 ``DW_MSPACE_LLVM_none`` 0x0000 Generic (Flat)
1978 ``DW_MSPACE_LLVM_global`` 0x0001 Global
1979 ``DW_MSPACE_LLVM_constant`` 0x0002 Global
1980 ``DW_MSPACE_LLVM_group`` 0x0003 Local (group/LDS)
1981 ``DW_MSPACE_LLVM_private`` 0x0004 Private (Scratch)
1982 ``DW_MSPACE_AMDGPU_region`` 0x8000 Region (GDS)
1983 =========================== ====== =================
1985 The DWARF memory space values defined in the *DWARF Extensions For Heterogeneous
1986 Debugging* section :ref:`amdgpu-dwarf-memory-spaces` are used.
1988 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1989 available for use for the AMD extension for access to the hardware GDS memory
1990 which is scratchpad memory allocated per device.
1992 For AMDGPU if no ``DW_AT_LLVM_memory_space`` attribute is present, then the
1993 default memory space of ``DW_MSPACE_LLVM_none`` is used.
1995 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1996 mapping of DWARF memory spaces to DWARF address spaces, including address size
1999 .. _amdgpu-dwarf-address-space-identifier:
2001 Address Space Identifier
2002 ------------------------
2004 DWARF address spaces correspond to target architecture specific linear
2005 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
2006 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-address-spaces`.
2008 The DWARF address space mapping used for AMDGPU is defined in
2009 :ref:`amdgpu-dwarf-address-space-mapping-table`.
2011 .. table:: AMDGPU DWARF Address Space Mapping
2012 :name: amdgpu-dwarf-address-space-mapping-table
2014 ======================================= ===== ======= ======== ===================== =======================
2016 --------------------------------------- ----- ---------------- --------------------- -----------------------
2017 Address Space Name Value Address Bit Size LLVM IR Address Space
2018 --------------------------------------- ----- ------- -------- --------------------- -----------------------
2023 ======================================= ===== ======= ======== ===================== =======================
2024 ``DW_ASPACE_LLVM_none`` 0x00 64 32 Global *default address space*
2025 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat)
2026 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS)
2027 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS)
2029 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane*
2030 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront*
2031 ======================================= ===== ======= ======== ===================== =======================
2033 See :ref:`amdgpu-address-spaces` for information on the AMDGPU LLVM IR address
2034 spaces including address size and NULL value.
2036 The ``DW_ASPACE_LLVM_none`` address space is the default target architecture
2037 address space used in DWARF operations that do not specify an address space. It
2038 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
2039 related operations can refer to addresses in the program code.
2041 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
2042 specify the flat address space. If the address corresponds to an address in the
2043 local address space, then it corresponds to the wavefront that is executing the
2044 focused thread of execution. If the address corresponds to an address in the
2045 private address space, then it corresponds to the lane that is executing the
2046 focused thread of execution for languages that are implemented using a SIMD or
2047 SIMT execution model.
2051 CUDA-like languages such as HIP that do not have address spaces in the
2052 language type system, but do allow variables to be allocated in different
2053 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
2054 address space in the DWARF expression operations as the default address space
2055 is the global address space.
2057 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2058 specify the local address space corresponding to the wavefront that is executing
2059 the focused thread of execution.
2061 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2062 to specify the private address space corresponding to the lane that is executing
2063 the focused thread of execution for languages that are implemented using a SIMD
2064 or SIMT execution model.
2066 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2067 to specify the unswizzled private address space corresponding to the wavefront
2068 that is executing the focused thread of execution. The wavefront view of private
2069 memory is the per wavefront unswizzled backing memory layout defined in
2070 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2071 location for the backing memory of the wavefront (namely the address is not
2072 offset by ``wavefront-scratch-base``). The following formula can be used to
2073 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2074 ``DW_ASPACE_AMDGPU_private_wave`` address:
2078 private-address-wavefront =
2079 ((private-address-lane / 4) * wavefront-size * 4) +
2080 (wavefront-lane-id * 4) + (private-address-lane % 4)
2082 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2083 of the dwords for each lane starting with lane 0 is required, then this
2088 private-address-wavefront =
2089 private-address-lane * wavefront-size
2091 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2092 complete spilled vector register back into a complete vector register in the
2093 CFI. The frame pointer can be a private lane address which is dword aligned,
2094 which can be shifted to multiply by the wavefront size, and then used to form a
2095 private wavefront address that gives a location for a contiguous set of dwords,
2096 one per lane, where the vector register dwords are spilled. The compiler knows
2097 the wavefront size since it generates the code. Note that the type of the
2098 address may have to be converted as the size of a
2099 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2100 ``DW_ASPACE_AMDGPU_private_wave`` address.
2102 .. _amdgpu-dwarf-lane-identifier:
2107 DWARF lane identifies specify a target architecture lane position for hardware
2108 that executes in a SIMD or SIMT manner, and on which a source language maps its
2109 threads of execution onto those lanes. The DWARF lane identifier is pushed by
2110 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2111 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2112 section :ref:`amdgpu-dwarf-operation-expressions`.
2114 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2115 wavefront. It is numbered from 0 to the wavefront size minus 1.
2117 Operation Expressions
2118 ---------------------
2120 DWARF expressions are used to compute program values and the locations of
2121 program objects. See DWARF Version 5 section 2.5 and
2122 :ref:`amdgpu-dwarf-operation-expressions`.
2124 DWARF location descriptions describe how to access storage which includes memory
2125 and registers. When accessing storage on AMDGPU, bytes are ordered with least
2126 significant bytes first, and bits are ordered within bytes with least
2127 significant bits first.
2129 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2130 unwinding vector registers that are spilled under the execution mask to memory:
2131 the zero-single location description is the vector register, and the one-single
2132 location description is the spilled memory location description. The
2133 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2134 memory location description.
2136 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2137 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2138 controlled by the execution mask. An undefined location description together
2139 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2140 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2142 Debugger Information Entry Attributes
2143 -------------------------------------
2145 This section describes how certain debugger information entry attributes are
2146 used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2147 which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2148 :ref:`amdgpu-dwarf-low-level-information` and
2149 :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2151 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2153 ``DW_AT_LLVM_lane_pc``
2154 ~~~~~~~~~~~~~~~~~~~~~~
2156 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2157 location of the separate lanes of a SIMT thread.
2159 If the lane is an active lane then this will be the same as the current program
2162 If the lane is inactive, but was active on entry to the subprogram, then this is
2163 the program location in the subprogram at which execution of the lane is
2164 conceptual positioned.
2166 If the lane was not active on entry to the subprogram, then this will be the
2167 undefined location. A client debugger can check if the lane is part of a valid
2168 work-group by checking that the lane is in the range of the associated
2169 work-group within the grid, accounting for partial work-groups. If it is not,
2170 then the debugger can omit any information for the lane. Otherwise, the debugger
2171 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2172 calling subprogram until it finds a non-undefined location. Conceptually the
2173 lane only has the call frames that it has a non-undefined
2174 ``DW_AT_LLVM_lane_pc``.
2176 The following example illustrates how the AMDGPU backend can generate a DWARF
2177 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2178 following subprogram pseudo code for a target with 64 lanes per wavefront.
2200 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2201 execution mask (``EXEC``) to linearize the control flow. The condition is
2202 evaluated to make a mask of the lanes for which the condition evaluates to true.
2203 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2204 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2205 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2206 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2207 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2208 region. This is shown below. Other approaches are possible, but the basic
2209 concept is the same.
2242 To create the DWARF location list expression that defines the location
2243 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2244 pseudo instruction can be used to annotate the linearized control flow. This can
2245 be done by defining an artificial variable for the lane PC. The DWARF location
2246 list expression created for it is used as the value of the
2247 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2249 A DWARF procedure is defined for each well nested structured control flow region
2250 which provides the conceptual lane program location for a lane if it is not
2251 active (namely it is divergent). The DWARF operation expression for each region
2252 conceptually inherits the value of the immediately enclosing region and modifies
2253 it according to the semantics of the region.
2255 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2256 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2257 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2258 region since the ``THEN`` region has completed.
2260 The lane PC artificial variable is assigned at each region transition. It uses
2261 the immediately enclosing region's DWARF procedure to compute the program
2262 location for each lane assuming they are divergent, and then modifies the result
2263 by inserting the current program location for each lane that the ``EXEC`` mask
2264 indicates is active.
2266 By having separate DWARF procedures for each region, they can be reused to
2267 define the value for any nested region. This reduces the total size of the DWARF
2268 operation expressions.
2270 The following provides an example using pseudo LLVM MIR.
2276 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2277 DW_AT_name = "__uint64";
2278 DW_AT_byte_size = 8;
2279 DW_AT_encoding = DW_ATE_unsigned;
2281 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2282 DW_AT_name = "__active_lane_pc";
2285 DW_OP_LLVM_extend 64, 64;
2286 DW_OP_regval_type EXEC, %uint_64;
2287 DW_OP_LLVM_select_bit_piece 64, 64;
2290 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2291 DW_AT_name = "__divergent_lane_pc";
2293 DW_OP_LLVM_undefined;
2294 DW_OP_LLVM_extend 64, 64;
2297 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2298 DW_OP_call_ref %__divergent_lane_pc;
2299 DW_OP_call_ref %__active_lane_pc;
2303 DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2308 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2309 DW_AT_name = "__divergent_lane_pc_1_then";
2310 DW_AT_location = DIExpression[
2311 DW_OP_call_ref %__divergent_lane_pc;
2312 DW_OP_addrx &lex_1_start;
2314 DW_OP_LLVM_extend 64, 64;
2315 DW_OP_call_ref %__lex_1_save_exec;
2316 DW_OP_deref_type 64, %__uint_64;
2317 DW_OP_LLVM_select_bit_piece 64, 64;
2320 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2321 DW_OP_call_ref %__divergent_lane_pc_1_then;
2322 DW_OP_call_ref %__active_lane_pc;
2326 DBG_VALUE %3, %__lex_1_1_save_exec;
2331 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2332 DW_AT_name = "__divergent_lane_pc_1_1_then";
2333 DW_AT_location = DIExpression[
2334 DW_OP_call_ref %__divergent_lane_pc_1_then;
2335 DW_OP_addrx &lex_1_1_start;
2337 DW_OP_LLVM_extend 64, 64;
2338 DW_OP_call_ref %__lex_1_1_save_exec;
2339 DW_OP_deref_type 64, %__uint_64;
2340 DW_OP_LLVM_select_bit_piece 64, 64;
2343 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2344 DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2345 DW_OP_call_ref %__active_lane_pc;
2350 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2351 DW_AT_name = "__divergent_lane_pc_1_1_else";
2352 DW_AT_location = DIExpression[
2353 DW_OP_call_ref %__divergent_lane_pc_1_then;
2354 DW_OP_addrx &lex_1_1_end;
2356 DW_OP_LLVM_extend 64, 64;
2357 DW_OP_call_ref %__lex_1_1_save_exec;
2358 DW_OP_deref_type 64, %__uint_64;
2359 DW_OP_LLVM_select_bit_piece 64, 64;
2362 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2363 DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2364 DW_OP_call_ref %__active_lane_pc;
2369 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2370 DW_OP_call_ref %__divergent_lane_pc;
2371 DW_OP_call_ref %__active_lane_pc;
2376 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2377 DW_AT_name = "__divergent_lane_pc_1_else";
2378 DW_AT_location = DIExpression[
2379 DW_OP_call_ref %__divergent_lane_pc;
2380 DW_OP_addrx &lex_1_end;
2382 DW_OP_LLVM_extend 64, 64;
2383 DW_OP_call_ref %__lex_1_save_exec;
2384 DW_OP_deref_type 64, %__uint_64;
2385 DW_OP_LLVM_select_bit_piece 64, 64;
2388 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2389 DW_OP_call_ref %__divergent_lane_pc_1_else;
2390 DW_OP_call_ref %__active_lane_pc;
2395 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2396 DW_OP_call_ref %__divergent_lane_pc;
2397 DW_OP_call_ref %__active_lane_pc;
2402 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2403 that are active, with the current program location.
2405 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2406 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2407 instruction, location list entries will be created that describe where the
2408 artificial variables are allocated at any given program location. The compiler
2409 may allocate them to registers or spill them to memory.
2411 The DWARF procedures for each region use the values of the saved execution mask
2412 artificial variables to only update the lanes that are active on entry to the
2413 region. All other lanes retain the value of the enclosing region where they were
2414 last active. If they were not active on entry to the subprogram, then will have
2415 the undefined location description.
2417 Other structured control flow regions can be handled similarly. For example,
2418 loops would set the divergent program location for the region at the end of the
2419 loop. Any lanes active will be in the loop, and any lanes not active must have
2422 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2423 ``IF/THEN/ELSE`` regions.
2425 The DWARF procedures can use the active lane artificial variable described in
2426 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2427 ``EXEC`` mask in order to support whole or quad wavefront mode.
2429 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2431 ``DW_AT_LLVM_active_lane``
2432 ~~~~~~~~~~~~~~~~~~~~~~~~~~
2434 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2435 entry is used to specify the lanes that are conceptually active for a SIMT
2438 The execution mask may be modified to implement whole or quad wavefront mode
2439 operations. For example, all lanes may need to temporarily be made active to
2440 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2441 update it to enable the necessary lanes, perform the operations, and then
2442 restore the ``EXEC`` mask from the saved value. While executing the whole
2443 wavefront region, the conceptual execution mask is the saved value, not the
2446 This is handled by defining an artificial variable for the active lane mask. The
2447 active lane mask artificial variable would be the actual ``EXEC`` mask for
2448 normal regions, and the saved execution mask for regions where the mask is
2449 temporarily updated. The location list expression created for this artificial
2450 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2453 ``DW_AT_LLVM_augmentation``
2454 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2456 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2457 debugger information entry has the following value for the augmentation string:
2463 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2464 extensions used in the DWARF of the compilation unit. The version number
2465 conforms to [SEMVER]_.
2467 Call Frame Information
2468 ----------------------
2470 DWARF Call Frame Information (CFI) describes how a consumer can virtually
2471 *unwind* call frames in a running process or core dump. See DWARF Version 5
2472 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2474 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2476 1. ``augmentation`` string contains the following null-terminated UTF-8 string:
2482 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2483 extensions used in this CIE or to the FDEs that use it. The version number
2484 conforms to [SEMVER]_.
2486 2. ``address_size`` for the ``Global`` address space is defined in
2487 :ref:`amdgpu-dwarf-address-space-identifier`.
2489 3. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2491 4. ``code_alignment_factor`` is 4 bytes.
2495 Add to :ref:`amdgpu-processor-table` table.
2497 5. ``data_alignment_factor`` is 4 bytes.
2501 Add to :ref:`amdgpu-processor-table` table.
2503 6. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2504 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2506 7. ``initial_instructions`` Since a subprogram X with fewer registers can be
2507 called from subprogram Y that has more allocated, X will not change any of
2508 the extra registers as it cannot access them. Therefore, the default rule
2509 for all columns is ``same value``.
2511 For AMDGPU the register number follows the numbering defined in
2512 :ref:`amdgpu-dwarf-register-identifier`.
2514 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2515 the return address to get the address of a byte within the call site
2516 instructions. See DWARF Version 5 section 6.4.4.
2521 See DWARF Version 5 section 6.1.
2523 Lookup By Name Section Header
2524 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2526 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2528 For AMDGPU the lookup by name section header table:
2530 ``augmentation_string_size`` (uword)
2532 Set to the length of the ``augmentation_string`` value which is always a
2535 ``augmentation_string`` (sequence of UTF-8 characters)
2537 Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2543 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2544 extensions used in the DWARF of this index. The version number conforms to
2549 This is different to the DWARF Version 5 definition that requires the first
2550 4 characters to be the vendor ID. But this is consistent with the other
2551 augmentation strings and does allow multiple vendor contributions. However,
2552 backwards compatibility may be more desirable.
2554 Lookup By Address Section Header
2555 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2557 See DWARF Version 5 section 6.1.2.
2559 For AMDGPU the lookup by address section header table:
2561 ``address_size`` (ubyte)
2563 Match the address size for the ``Global`` address space defined in
2564 :ref:`amdgpu-dwarf-address-space-identifier`.
2566 ``segment_selector_size`` (ubyte)
2568 AMDGPU does not use a segment selector so this is 0. The entries in the
2569 ``.debug_aranges`` do not have a segment selector.
2571 Line Number Information
2572 -----------------------
2574 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2576 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2577 The instruction set must be obtained from the ELF file header ``e_flags`` field
2578 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2579 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2583 Should the ``isa`` state machine register be used to indicate if the code is
2584 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2586 For AMDGPU the line number program header fields have the following values (see
2587 DWARF Version 5 section 6.2.4):
2589 ``address_size`` (ubyte)
2590 Matches the address size for the ``Global`` address space defined in
2591 :ref:`amdgpu-dwarf-address-space-identifier`.
2593 ``segment_selector_size`` (ubyte)
2594 AMDGPU does not use a segment selector so this is 0.
2596 ``minimum_instruction_length`` (ubyte)
2597 For GFX9-GFX11 this is 4.
2599 ``maximum_operations_per_instruction`` (ubyte)
2600 For GFX9-GFX11 this is 1.
2602 Source text for online-compiled programs (for example, those compiled by the
2603 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2604 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2605 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2606 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2608 The Clang option used to control source embedding in AMDGPU is defined in
2609 :ref:`amdgpu-clang-debug-options-table`.
2611 .. table:: AMDGPU Clang Debug Options
2612 :name: amdgpu-clang-debug-options-table
2614 ==================== ==================================================
2615 Debug Flag Description
2616 ==================== ==================================================
2617 -g[no-]embed-source Enable/disable embedding source text in DWARF
2618 debug sections. Useful for environments where
2619 source cannot be written to disk, such as
2620 when performing online compilation.
2621 ==================== ==================================================
2626 Enable the embedded source.
2628 ``-gno-embed-source``
2629 Disable the embedded source.
2631 32-Bit and 64-Bit DWARF Formats
2632 -------------------------------
2634 See DWARF Version 5 section 7.4 and
2635 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2639 * For the ``amdgcn`` target architecture only the 64-bit process address space
2642 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2643 the 32-bit DWARF format.
2648 For AMDGPU the following values apply for each of the unit headers described in
2649 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2651 ``address_size`` (ubyte)
2652 Matches the address size for the ``Global`` address space defined in
2653 :ref:`amdgpu-dwarf-address-space-identifier`.
2655 .. _amdgpu-code-conventions:
2660 This section provides code conventions used for each supported target triple OS
2661 (see :ref:`amdgpu-target-triples`).
2666 This section provides code conventions used when the target triple OS is
2667 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
2669 .. _amdgpu-amdhsa-code-object-metadata:
2671 Code Object Metadata
2672 ~~~~~~~~~~~~~~~~~~~~
2674 The code object metadata specifies extensible metadata associated with the code
2675 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2676 encoding and semantics of this metadata depends on the code object version; see
2677 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2678 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2679 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2680 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
2682 Code object metadata is specified in a note record (see
2683 :ref:`amdgpu-note-records`) and is required when the target triple OS is
2684 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2685 information necessary to support the HSA compatible runtime kernel queries. For
2686 example, the segment sizes needed in a dispatch packet. In addition, a
2687 high-level language runtime may require other information to be included. For
2688 example, the AMD OpenCL runtime records kernel argument information.
2690 .. _amdgpu-amdhsa-code-object-metadata-v2:
2692 Code Object V2 Metadata
2693 +++++++++++++++++++++++
2696 Code object V2 is not the default code object version emitted by this version
2699 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2700 (see :ref:`amdgpu-note-records-v2`).
2702 The metadata is specified as a YAML formatted string (see [YAML]_ and
2707 Is the string null terminated? It probably should not if YAML allows it to
2708 contain null characters, otherwise it should be.
2710 The metadata is represented as a single YAML document comprised of the mapping
2711 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2714 For boolean values, the string values of ``false`` and ``true`` are used for
2715 false and true respectively.
2717 Additional information can be added to the mappings. To avoid conflicts, any
2718 non-AMD key names should be prefixed by "*vendor-name*.".
2720 .. table:: AMDHSA Code Object V2 Metadata Map
2721 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2723 ========== ============== ========= =======================================
2724 String Key Value Type Required? Description
2725 ========== ============== ========= =======================================
2726 "Version" sequence of Required - The first integer is the major
2727 2 integers version. Currently 1.
2728 - The second integer is the minor
2729 version. Currently 0.
2730 "Printf" sequence of Each string is encoded information
2731 strings about a printf function call. The
2732 encoded information is organized as
2733 fields separated by colon (':'):
2735 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2740 A 32-bit integer as a unique id for
2741 each printf function call
2744 A 32-bit integer equal to the number
2745 of arguments of printf function call
2748 ``S[i]`` (where i = 0, 1, ... , N-1)
2749 32-bit integers for the size in bytes
2750 of the i-th FormatString argument of
2751 the printf function call
2754 The format string passed to the
2755 printf function call.
2756 "Kernels" sequence of Required Sequence of the mappings for each
2757 mapping kernel in the code object. See
2758 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2759 for the definition of the mapping.
2760 ========== ============== ========= =======================================
2764 .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2765 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2767 ================= ============== ========= ================================
2768 String Key Value Type Required? Description
2769 ================= ============== ========= ================================
2770 "Name" string Required Source name of the kernel.
2771 "SymbolName" string Required Name of the kernel
2772 descriptor ELF symbol.
2773 "Language" string Source language of the kernel.
2781 "LanguageVersion" sequence of - The first integer is the major
2783 - The second integer is the
2785 "Attrs" mapping Mapping of kernel attributes.
2787 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2788 for the mapping definition.
2789 "Args" sequence of Sequence of mappings of the
2790 mapping kernel arguments. See
2791 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2792 for the definition of the mapping.
2793 "CodeProps" mapping Mapping of properties related to
2794 the kernel code. See
2795 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2796 for the mapping definition.
2797 ================= ============== ========= ================================
2801 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2802 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2804 =================== ============== ========= ==============================
2805 String Key Value Type Required? Description
2806 =================== ============== ========= ==============================
2807 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
2808 3 integers must be >=1 and the dispatch
2809 work-group size X, Y, Z must
2810 correspond to the specified
2811 values. Defaults to 0, 0, 0.
2813 Corresponds to the OpenCL
2814 ``reqd_work_group_size``
2816 "WorkGroupSizeHint" sequence of The dispatch work-group size
2817 3 integers X, Y, Z is likely to be the
2820 Corresponds to the OpenCL
2821 ``work_group_size_hint``
2823 "VecTypeHint" string The name of a scalar or vector
2826 Corresponds to the OpenCL
2827 ``vec_type_hint`` attribute.
2829 "RuntimeHandle" string The external symbol name
2830 associated with a kernel.
2831 OpenCL runtime allocates a
2832 global buffer for the symbol
2833 and saves the kernel's address
2834 to it, which is used for
2835 device side enqueueing. Only
2836 available for device side
2838 =================== ============== ========= ==============================
2842 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2843 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2845 ================= ============== ========= ================================
2846 String Key Value Type Required? Description
2847 ================= ============== ========= ================================
2848 "Name" string Kernel argument name.
2849 "TypeName" string Kernel argument type name.
2850 "Size" integer Required Kernel argument size in bytes.
2851 "Align" integer Required Kernel argument alignment in
2852 bytes. Must be a power of two.
2853 "ValueKind" string Required Kernel argument kind that
2854 specifies how to set up the
2855 corresponding argument.
2859 The argument is copied
2860 directly into the kernarg.
2863 A global address space pointer
2864 to the buffer data is passed
2867 "DynamicSharedPointer"
2868 A group address space pointer
2869 to dynamically allocated LDS
2870 is passed in the kernarg.
2873 A global address space
2874 pointer to a S# is passed in
2878 A global address space
2879 pointer to a T# is passed in
2883 A global address space pointer
2884 to an OpenCL pipe is passed in
2888 A global address space pointer
2889 to an OpenCL device enqueue
2890 queue is passed in the
2893 "HiddenGlobalOffsetX"
2894 The OpenCL grid dispatch
2895 global offset for the X
2896 dimension is passed in the
2899 "HiddenGlobalOffsetY"
2900 The OpenCL grid dispatch
2901 global offset for the Y
2902 dimension is passed in the
2905 "HiddenGlobalOffsetZ"
2906 The OpenCL grid dispatch
2907 global offset for the Z
2908 dimension is passed in the
2912 An argument that is not used
2913 by the kernel. Space needs to
2914 be left for it, but it does
2915 not need to be set up.
2917 "HiddenPrintfBuffer"
2918 A global address space pointer
2919 to the runtime printf buffer
2920 is passed in kernarg. Mutually
2922 "HiddenHostcallBuffer".
2924 "HiddenHostcallBuffer"
2925 A global address space pointer
2926 to the runtime hostcall buffer
2927 is passed in kernarg. Mutually
2929 "HiddenPrintfBuffer".
2931 "HiddenDefaultQueue"
2932 A global address space pointer
2933 to the OpenCL device enqueue
2934 queue that should be used by
2935 the kernel by default is
2936 passed in the kernarg.
2938 "HiddenCompletionAction"
2939 A global address space pointer
2940 to help link enqueued kernels into
2941 the ancestor tree for determining
2942 when the parent kernel has finished.
2944 "HiddenMultiGridSyncArg"
2945 A global address space pointer for
2946 multi-grid synchronization is
2947 passed in the kernarg.
2949 "ValueType" string Unused and deprecated. This should no longer
2950 be emitted, but is accepted for compatibility.
2953 "PointeeAlign" integer Alignment in bytes of pointee
2954 type for pointer type kernel
2955 argument. Must be a power
2956 of 2. Only present if
2958 "DynamicSharedPointer".
2959 "AddrSpaceQual" string Kernel argument address space
2960 qualifier. Only present if
2961 "ValueKind" is "GlobalBuffer" or
2962 "DynamicSharedPointer". Values
2974 Is GlobalBuffer only Global
2976 DynamicSharedPointer always
2977 Local? Can HCC allow Generic?
2978 How can Private or Region
2981 "AccQual" string Kernel argument access
2982 qualifier. Only present if
2983 "ValueKind" is "Image" or
2996 "ActualAccQual" string The actual memory accesses
2997 performed by the kernel on the
2998 kernel argument. Only present if
2999 "ValueKind" is "GlobalBuffer",
3000 "Image", or "Pipe". This may be
3001 more restrictive than indicated
3002 by "AccQual" to reflect what the
3003 kernel actual does. If not
3004 present then the runtime must
3005 assume what is implied by
3006 "AccQual" and "IsConst". Values
3013 "IsConst" boolean Indicates if the kernel argument
3014 is const qualified. Only present
3018 "IsRestrict" boolean Indicates if the kernel argument
3019 is restrict qualified. Only
3020 present if "ValueKind" is
3023 "IsVolatile" boolean Indicates if the kernel argument
3024 is volatile qualified. Only
3025 present if "ValueKind" is
3028 "IsPipe" boolean Indicates if the kernel argument
3029 is pipe qualified. Only present
3030 if "ValueKind" is "Pipe".
3034 Can GlobalBuffer be pipe
3037 ================= ============== ========= ================================
3041 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
3042 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
3044 ============================ ============== ========= =====================
3045 String Key Value Type Required? Description
3046 ============================ ============== ========= =====================
3047 "KernargSegmentSize" integer Required The size in bytes of
3049 that holds the values
3052 "GroupSegmentFixedSize" integer Required The amount of group
3056 bytes. This does not
3058 dynamically allocated
3059 group segment memory
3063 "PrivateSegmentFixedSize" integer Required The amount of fixed
3064 private address space
3065 memory required for a
3067 bytes. If the kernel
3069 stack then additional
3071 to this value for the
3073 "KernargSegmentAlign" integer Required The maximum byte
3076 kernarg segment. Must
3078 "WavefrontSize" integer Required Wavefront size. Must
3080 "NumSGPRs" integer Required Number of scalar
3084 includes the special
3086 Scratch (GFX7-GFX10)
3088 GFX8-GFX10). It does
3090 SGPR added if a trap
3096 "NumVGPRs" integer Required Number of vector
3100 "MaxFlatWorkGroupSize" integer Required Maximum flat
3103 kernel in work-items.
3106 ReqdWorkGroupSize if
3108 "NumSpilledSGPRs" integer Number of stores from
3109 a scalar register to
3110 a register allocator
3113 "NumSpilledVGPRs" integer Number of stores from
3114 a vector register to
3115 a register allocator
3118 ============================ ============== ========= =====================
3120 .. _amdgpu-amdhsa-code-object-metadata-v3:
3122 Code Object V3 Metadata
3123 +++++++++++++++++++++++
3126 Code object V3 is not the default code object version emitted by this version
3129 Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3130 record (see :ref:`amdgpu-note-records-v3-onwards`).
3132 The metadata is represented as Message Pack formatted binary data (see
3133 [MsgPack]_). The top level is a Message Pack map that includes the
3134 keys defined in table
3135 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3138 Additional information can be added to the maps. To avoid conflicts,
3139 any key names should be prefixed by "*vendor-name*." where
3140 ``vendor-name`` can be the name of the vendor and specific vendor
3141 tool that generates the information. The prefix is abbreviated to
3142 simply "." when it appears within a map that has been added by the
3145 .. table:: AMDHSA Code Object V3 Metadata Map
3146 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3148 ================= ============== ========= =======================================
3149 String Key Value Type Required? Description
3150 ================= ============== ========= =======================================
3151 "amdhsa.version" sequence of Required - The first integer is the major
3152 2 integers version. Currently 1.
3153 - The second integer is the minor
3154 version. Currently 0.
3155 "amdhsa.printf" sequence of Each string is encoded information
3156 strings about a printf function call. The
3157 encoded information is organized as
3158 fields separated by colon (':'):
3160 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3165 A 32-bit integer as a unique id for
3166 each printf function call
3169 A 32-bit integer equal to the number
3170 of arguments of printf function call
3173 ``S[i]`` (where i = 0, 1, ... , N-1)
3174 32-bit integers for the size in bytes
3175 of the i-th FormatString argument of
3176 the printf function call
3179 The format string passed to the
3180 printf function call.
3181 "amdhsa.kernels" sequence of Required Sequence of the maps for each
3182 map kernel in the code object. See
3183 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3184 for the definition of the keys included
3186 ================= ============== ========= =======================================
3190 .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3191 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3193 =================================== ============== ========= ================================
3194 String Key Value Type Required? Description
3195 =================================== ============== ========= ================================
3196 ".name" string Required Source name of the kernel.
3197 ".symbol" string Required Name of the kernel
3198 descriptor ELF symbol.
3199 ".language" string Source language of the kernel.
3209 ".language_version" sequence of - The first integer is the major
3211 - The second integer is the
3213 ".args" sequence of Sequence of maps of the
3214 map kernel arguments. See
3215 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3216 for the definition of the keys
3217 included in that map.
3218 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
3219 3 integers must be >=1 and the dispatch
3220 work-group size X, Y, Z must
3221 correspond to the specified
3222 values. Defaults to 0, 0, 0.
3224 Corresponds to the OpenCL
3225 ``reqd_work_group_size``
3227 ".workgroup_size_hint" sequence of The dispatch work-group size
3228 3 integers X, Y, Z is likely to be the
3231 Corresponds to the OpenCL
3232 ``work_group_size_hint``
3234 ".vec_type_hint" string The name of a scalar or vector
3237 Corresponds to the OpenCL
3238 ``vec_type_hint`` attribute.
3240 ".device_enqueue_symbol" string The external symbol name
3241 associated with a kernel.
3242 OpenCL runtime allocates a
3243 global buffer for the symbol
3244 and saves the kernel's address
3245 to it, which is used for
3246 device side enqueueing. Only
3247 available for device side
3249 ".kernarg_segment_size" integer Required The size in bytes of
3251 that holds the values
3254 ".group_segment_fixed_size" integer Required The amount of group
3258 bytes. This does not
3260 dynamically allocated
3261 group segment memory
3265 ".private_segment_fixed_size" integer Required The amount of fixed
3266 private address space
3267 memory required for a
3269 bytes. If the kernel
3271 stack then additional
3273 to this value for the
3275 ".kernarg_segment_align" integer Required The maximum byte
3278 kernarg segment. Must
3280 ".wavefront_size" integer Required Wavefront size. Must
3282 ".sgpr_count" integer Required Number of scalar
3283 registers required by a
3285 GFX6-GFX9. A register
3286 is required if it is
3288 if a higher numbered
3291 includes the special
3297 SGPR added if a trap
3303 ".vgpr_count" integer Required Number of vector
3304 registers required by
3306 GFX6-GFX9. A register
3307 is required if it is
3309 if a higher numbered
3312 ".agpr_count" integer Required Number of accumulator
3313 registers required by
3316 ".max_flat_workgroup_size" integer Required Maximum flat
3319 kernel in work-items.
3322 ReqdWorkGroupSize if
3324 ".sgpr_spill_count" integer Number of stores from
3325 a scalar register to
3326 a register allocator
3329 ".vgpr_spill_count" integer Number of stores from
3330 a vector register to
3331 a register allocator
3334 ".kind" string The kind of the kernel
3342 These kernels must be
3343 invoked after loading
3353 These kernels must be
3356 containing code object
3357 and after all init and
3358 normal kernels in the
3359 same code object have
3363 If omitted, "normal" is
3365 =================================== ============== ========= ================================
3369 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3370 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3372 ====================== ============== ========= ================================
3373 String Key Value Type Required? Description
3374 ====================== ============== ========= ================================
3375 ".name" string Kernel argument name.
3376 ".type_name" string Kernel argument type name.
3377 ".size" integer Required Kernel argument size in bytes.
3378 ".offset" integer Required Kernel argument offset in
3379 bytes. The offset must be a
3380 multiple of the alignment
3381 required by the argument.
3382 ".value_kind" string Required Kernel argument kind that
3383 specifies how to set up the
3384 corresponding argument.
3388 The argument is copied
3389 directly into the kernarg.
3392 A global address space pointer
3393 to the buffer data is passed
3396 "dynamic_shared_pointer"
3397 A group address space pointer
3398 to dynamically allocated LDS
3399 is passed in the kernarg.
3402 A global address space
3403 pointer to a S# is passed in
3407 A global address space
3408 pointer to a T# is passed in
3412 A global address space pointer
3413 to an OpenCL pipe is passed in
3417 A global address space pointer
3418 to an OpenCL device enqueue
3419 queue is passed in the
3422 "hidden_global_offset_x"
3423 The OpenCL grid dispatch
3424 global offset for the X
3425 dimension is passed in the
3428 "hidden_global_offset_y"
3429 The OpenCL grid dispatch
3430 global offset for the Y
3431 dimension is passed in the
3434 "hidden_global_offset_z"
3435 The OpenCL grid dispatch
3436 global offset for the Z
3437 dimension is passed in the
3441 An argument that is not used
3442 by the kernel. Space needs to
3443 be left for it, but it does
3444 not need to be set up.
3446 "hidden_printf_buffer"
3447 A global address space pointer
3448 to the runtime printf buffer
3449 is passed in kernarg. Mutually
3451 "hidden_hostcall_buffer"
3452 before Code Object V5.
3454 "hidden_hostcall_buffer"
3455 A global address space pointer
3456 to the runtime hostcall buffer
3457 is passed in kernarg. Mutually
3459 "hidden_printf_buffer"
3460 before Code Object V5.
3462 "hidden_default_queue"
3463 A global address space pointer
3464 to the OpenCL device enqueue
3465 queue that should be used by
3466 the kernel by default is
3467 passed in the kernarg.
3469 "hidden_completion_action"
3470 A global address space pointer
3471 to help link enqueued kernels into
3472 the ancestor tree for determining
3473 when the parent kernel has finished.
3475 "hidden_multigrid_sync_arg"
3476 A global address space pointer for
3477 multi-grid synchronization is
3478 passed in the kernarg.
3480 ".value_type" string Unused and deprecated. This should no longer
3481 be emitted, but is accepted for compatibility.
3483 ".pointee_align" integer Alignment in bytes of pointee
3484 type for pointer type kernel
3485 argument. Must be a power
3486 of 2. Only present if
3488 "dynamic_shared_pointer".
3489 ".address_space" string Kernel argument address space
3490 qualifier. Only present if
3491 ".value_kind" is "global_buffer" or
3492 "dynamic_shared_pointer". Values
3504 Is "global_buffer" only "global"
3506 "dynamic_shared_pointer" always
3507 "local"? Can HCC allow "generic"?
3508 How can "private" or "region"
3511 ".access" string Kernel argument access
3512 qualifier. Only present if
3513 ".value_kind" is "image" or
3526 ".actual_access" string The actual memory accesses
3527 performed by the kernel on the
3528 kernel argument. Only present if
3529 ".value_kind" is "global_buffer",
3530 "image", or "pipe". This may be
3531 more restrictive than indicated
3532 by ".access" to reflect what the
3533 kernel actual does. If not
3534 present then the runtime must
3535 assume what is implied by
3536 ".access" and ".is_const" . Values
3543 ".is_const" boolean Indicates if the kernel argument
3544 is const qualified. Only present
3548 ".is_restrict" boolean Indicates if the kernel argument
3549 is restrict qualified. Only
3550 present if ".value_kind" is
3553 ".is_volatile" boolean Indicates if the kernel argument
3554 is volatile qualified. Only
3555 present if ".value_kind" is
3558 ".is_pipe" boolean Indicates if the kernel argument
3559 is pipe qualified. Only present
3560 if ".value_kind" is "pipe".
3564 Can "global_buffer" be pipe
3567 ====================== ============== ========= ================================
3569 .. _amdgpu-amdhsa-code-object-metadata-v4:
3571 Code Object V4 Metadata
3572 +++++++++++++++++++++++
3574 Code object V4 metadata is the same as
3575 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3576 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
3578 .. table:: AMDHSA Code Object V4 Metadata Map Changes
3579 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3581 ================= ============== ========= =======================================
3582 String Key Value Type Required? Description
3583 ================= ============== ========= =======================================
3584 "amdhsa.version" sequence of Required - The first integer is the major
3585 2 integers version. Currently 1.
3586 - The second integer is the minor
3587 version. Currently 1.
3588 "amdhsa.target" string Required The target name of the code using the syntax:
3592 <target-triple> [ "-" <target-id> ]
3594 A canonical target ID must be
3595 used. See :ref:`amdgpu-target-triples`
3596 and :ref:`amdgpu-target-id`.
3597 ================= ============== ========= =======================================
3599 .. _amdgpu-amdhsa-code-object-metadata-v5:
3601 Code Object V5 Metadata
3602 +++++++++++++++++++++++
3605 Code object V5 is not the default code object version emitted by this version
3609 Code object V5 metadata is the same as
3610 :ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
3611 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5`, table
3612 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5` and table
3613 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
3615 .. table:: AMDHSA Code Object V5 Metadata Map Changes
3616 :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
3618 ================= ============== ========= =======================================
3619 String Key Value Type Required? Description
3620 ================= ============== ========= =======================================
3621 "amdhsa.version" sequence of Required - The first integer is the major
3622 2 integers version. Currently 1.
3623 - The second integer is the minor
3624 version. Currently 2.
3625 ================= ============== ========= =======================================
3629 .. table:: AMDHSA Code Object V5 Kernel Metadata Map Additions
3630 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5
3632 ============================= ============= ========== =======================================
3633 String Key Value Type Required? Description
3634 ============================= ============= ========== =======================================
3635 ".uses_dynamic_stack" boolean Indicates if the generated machine code
3636 is using a dynamically sized stack.
3637 ".workgroup_processor_mode" boolean (GFX10+) Controls ENABLE_WGP_MODE in
3638 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3639 ============================= ============= ========== =======================================
3643 .. table:: AMDHSA Code Object V5 Kernel Attribute Metadata Map
3644 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v5-table
3646 =========================== ============== ========= ==============================
3647 String Key Value Type Required? Description
3648 =========================== ============== ========= ==============================
3649 ".uniform_work_group_size" integer Indicates if the kernel
3650 requires that each dimension
3651 of global size is a multiple
3652 of corresponding dimension of
3653 work-group size. Value of 1
3654 implies true and value of 0
3655 implies false. Metadata is
3656 only emitted when value is 1.
3657 =========================== ============== ========= ==============================
3663 .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
3664 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
3666 ====================== ============== ========= ================================
3667 String Key Value Type Required? Description
3668 ====================== ============== ========= ================================
3669 ".value_kind" string Required Kernel argument kind that
3670 specifies how to set up the
3671 corresponding argument.
3673 the same as code object V3 metadata
3674 (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
3675 with the following additions:
3677 "hidden_block_count_x"
3678 The grid dispatch work-group count for the X dimension
3679 is passed in the kernarg. Some languages, such as OpenCL,
3680 support a last work-group in each dimension being partial.
3681 This count only includes the non-partial work-group count.
3682 This is not the same as the value in the AQL dispatch packet,
3683 which has the grid size in work-items.
3685 "hidden_block_count_y"
3686 The grid dispatch work-group count for the Y dimension
3687 is passed in the kernarg. Some languages, such as OpenCL,
3688 support a last work-group in each dimension being partial.
3689 This count only includes the non-partial work-group count.
3690 This is not the same as the value in the AQL dispatch packet,
3691 which has the grid size in work-items. If the grid dimensionality
3692 is 1, then must be 1.
3694 "hidden_block_count_z"
3695 The grid dispatch work-group count for the Z dimension
3696 is passed in the kernarg. Some languages, such as OpenCL,
3697 support a last work-group in each dimension being partial.
3698 This count only includes the non-partial work-group count.
3699 This is not the same as the value in the AQL dispatch packet,
3700 which has the grid size in work-items. If the grid dimensionality
3701 is 1 or 2, then must be 1.
3703 "hidden_group_size_x"
3704 The grid dispatch work-group size for the X dimension is
3705 passed in the kernarg. This size only applies to the
3706 non-partial work-groups. This is the same value as the AQL
3707 dispatch packet work-group size.
3709 "hidden_group_size_y"
3710 The grid dispatch work-group size for the Y dimension is
3711 passed in the kernarg. This size only applies to the
3712 non-partial work-groups. This is the same value as the AQL
3713 dispatch packet work-group size. If the grid dimensionality
3714 is 1, then must be 1.
3716 "hidden_group_size_z"
3717 The grid dispatch work-group size for the Z dimension is
3718 passed in the kernarg. This size only applies to the
3719 non-partial work-groups. This is the same value as the AQL
3720 dispatch packet work-group size. If the grid dimensionality
3721 is 1 or 2, then must be 1.
3723 "hidden_remainder_x"
3724 The grid dispatch work group size of the partial work group
3725 of the X dimension, if it exists. Must be zero if a partial
3726 work group does not exist in the X dimension.
3728 "hidden_remainder_y"
3729 The grid dispatch work group size of the partial work group
3730 of the Y dimension, if it exists. Must be zero if a partial
3731 work group does not exist in the Y dimension.
3733 "hidden_remainder_z"
3734 The grid dispatch work group size of the partial work group
3735 of the Z dimension, if it exists. Must be zero if a partial
3736 work group does not exist in the Z dimension.
3739 The grid dispatch dimensionality. This is the same value
3740 as the AQL dispatch packet dimensionality. Must be a value
3744 A global address space pointer to an initialized memory
3745 buffer that conforms to the requirements of the malloc/free
3746 device library V1 version implementation.
3748 "hidden_private_base"
3749 The high 32 bits of the flat addressing private aperture base.
3750 Only used by GFX8 to allow conversion between private segment
3751 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3753 "hidden_shared_base"
3754 The high 32 bits of the flat addressing shared aperture base.
3755 Only used by GFX8 to allow conversion between shared segment
3756 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3759 A global memory address space pointer to the ROCm runtime
3760 ``struct amd_queue_t`` structure for the HSA queue of the
3761 associated dispatch AQL packet. It is only required for pre-GFX9
3762 devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
3764 ====================== ============== ========= ================================
3771 The HSA architected queuing language (AQL) defines a user space memory interface
3772 that can be used to control the dispatch of kernels, in an agent independent
3773 way. An agent can have zero or more AQL queues created for it using an HSA
3774 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3775 are 64 bytes) can be placed. See the *HSA Platform System Architecture
3776 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3778 The packet processor of a kernel agent is responsible for detecting and
3779 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3780 packet processor is implemented by the hardware command processor (CP),
3781 asynchronous dispatch controller (ADC) and shader processor input controller
3784 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3785 the kernel mode driver to initialize and register the AQL queue with CP.
3787 To dispatch a kernel the following actions are performed. This can occur in the
3788 CPU host program, or from an HSA kernel executing on a GPU.
3790 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3791 executed is obtained.
3792 2. A pointer to the kernel descriptor (see
3793 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3794 It must be for a kernel that is contained in a code object that was loaded
3795 by an HSA compatible runtime on the kernel agent with which the AQL queue is
3797 3. Space is allocated for the kernel arguments using the HSA compatible runtime
3798 allocator for a memory region with the kernarg property for the kernel agent
3799 that will execute the kernel. It must be at least 16-byte aligned.
3800 4. Kernel argument values are assigned to the kernel argument memory
3801 allocation. The layout is defined in the *HSA Programmer's Language
3802 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3803 kernel argument memory in the same way constant memory is accessed. (Note
3804 that the HSA specification allows an implementation to copy the kernel
3805 argument contents to another location that is accessed by the kernel.)
3806 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3807 runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3808 for the packet. The packet must be set up, and the final write must use an
3809 atomic store release to set the packet kind to ensure the packet contents are
3810 visible to the kernel agent. AQL defines a doorbell signal mechanism to
3811 notify the kernel agent that the AQL queue has been updated. These rules, and
3812 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3813 System Architecture Specification* [HSA]_.
3814 6. A kernel dispatch packet includes information about the actual dispatch,
3815 such as grid and work-group size, together with information from the code
3816 object about the kernel, such as segment sizes. The HSA compatible runtime
3817 queries on the kernel symbol can be used to obtain the code object values
3818 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
3819 7. CP executes micro-code and is responsible for detecting and setting up the
3820 GPU to execute the wavefronts of a kernel dispatch.
3821 8. CP ensures that when the a wavefront starts executing the kernel machine
3822 code, the scalar general purpose registers (SGPR) and vector general purpose
3823 registers (VGPR) are set up as required by the machine code. The required
3824 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3825 register state is defined in
3826 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
3827 9. The prolog of the kernel machine code (see
3828 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3829 before continuing executing the machine code that corresponds to the kernel.
3830 10. When the kernel dispatch has completed execution, CP signals the completion
3831 signal specified in the kernel dispatch packet if not 0.
3833 .. _amdgpu-amdhsa-memory-spaces:
3838 The memory space properties are:
3840 .. table:: AMDHSA Memory Spaces
3841 :name: amdgpu-amdhsa-memory-spaces-table
3843 ================= =========== ======== ======= ==================
3844 Memory Space Name HSA Segment Hardware Address NULL Value
3846 ================= =========== ======== ======= ==================
3847 Private private scratch 32 0x00000000
3848 Local group LDS 32 0xFFFFFFFF
3849 Global global global 64 0x0000000000000000
3850 Constant constant *same as 64 0x0000000000000000
3852 Generic flat flat 64 0x0000000000000000
3853 Region N/A GDS 32 *not implemented
3855 ================= =========== ======== ======= ==================
3857 The global and constant memory spaces both use global virtual addresses, which
3858 are the same virtual address space used by the CPU. However, some virtual
3859 addresses may only be accessible to the CPU, some only accessible by the GPU,
3862 Using the constant memory space indicates that the data will not change during
3863 the execution of the kernel. This allows scalar read instructions to be
3864 used. The vector and scalar L1 caches are invalidated of volatile data before
3865 each kernel dispatch execution to allow constant memory to change values between
3868 The local memory space uses the hardware Local Data Store (LDS) which is
3869 automatically allocated when the hardware creates work-groups of wavefronts, and
3870 freed when all the wavefronts of a work-group have terminated. The data store
3871 (DS) instructions can be used to access it.
3873 The private memory space uses the hardware scratch memory support. If the kernel
3874 uses scratch, then the hardware allocates memory that is accessed using
3875 wavefront lane dword (4 byte) interleaving. The mapping used from private
3876 address to physical address is:
3878 ``wavefront-scratch-base +
3879 (private-address * wavefront-size * 4) +
3880 (wavefront-lane-id * 4)``
3882 There are different ways that the wavefront scratch base address is determined
3883 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3884 memory can be accessed in an interleaved manner using buffer instruction with
3885 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3886 instructions, or by flat instructions. If each lane of a wavefront accesses the
3887 same private address, the interleaving results in adjacent dwords being accessed
3888 and hence requires fewer cache lines to be fetched. Multi-dword access is not
3889 supported except by flat and scratch instructions in GFX9-GFX11.
3891 The generic address space uses the hardware flat address support available in
3892 GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
3893 local apertures), that are outside the range of addressible global memory, to
3894 map from a flat address to a private or local address.
3896 FLAT instructions can take a flat address and access global, private (scratch)
3897 and group (LDS) memory depending on if the address is within one of the
3898 aperture ranges. Flat access to scratch requires hardware aperture setup and
3899 setup in the kernel prologue (see
3900 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3901 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3902 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3904 To convert between a segment address and a flat address the base address of the
3905 apertures address can be used. For GFX7-GFX8 these are available in the
3906 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3907 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3908 GFX9-GFX11 the aperture base addresses are directly available as inline constant
3909 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3910 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3911 which makes it easier to convert from flat to segment or segment to flat.
3916 Image and sample handles created by an HSA compatible runtime (see
3917 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3918 object respectively. In order to support the HSA ``query_sampler`` operations
3919 two extra dwords are used to store the HSA BRIG enumeration values for the
3920 queries that are not trivially deducible from the S# representation.
3925 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3926 are 64-bit addresses of a structure allocated in memory accessible from both the
3927 CPU and GPU. The structure is defined by the runtime and subject to change
3928 between releases. For example, see [AMD-ROCm-github]_.
3930 .. _amdgpu-amdhsa-hsa-aql-queue:
3935 The HSA AQL queue structure is defined by an HSA compatible runtime (see
3936 :ref:`amdgpu-os`) and subject to change between releases. For example, see
3937 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
3938 certain language features such as the flat address aperture bases. It also
3939 contains fields used by CP such as managing the allocation of scratch memory.
3941 .. _amdgpu-amdhsa-kernel-descriptor:
3946 A kernel descriptor consists of the information needed by CP to initiate the
3947 execution of a kernel, including the entry point address of the machine code
3948 that implements the kernel.
3950 Code Object V3 Kernel Descriptor
3951 ++++++++++++++++++++++++++++++++
3953 CP microcode requires the Kernel descriptor to be allocated on 64-byte
3956 The fields used by CP for code objects before V3 also match those specified in
3957 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3959 .. table:: Code Object V3 Kernel Descriptor
3960 :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3962 ======= ======= =============================== ============================
3963 Bits Size Field Name Description
3964 ======= ======= =============================== ============================
3965 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
3966 address space memory
3967 required for a work-group
3968 in bytes. This does not
3969 include any dynamically
3970 allocated local address
3971 space memory that may be
3972 added when the kernel is
3974 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
3975 private address space
3976 memory required for a
3977 work-item in bytes. When
3978 this cannot be predicted,
3979 code object v4 and older
3980 sets this value to be
3981 higher than the minimum
3983 95:64 4 bytes KERNARG_SIZE The size of the kernarg
3984 memory pointed to by the
3985 AQL dispatch packet. The
3986 kernarg memory is used to
3987 pass arguments to the
3990 * If the kernarg pointer in
3991 the dispatch packet is NULL
3992 then there are no kernel
3994 * If the kernarg pointer in
3995 the dispatch packet is
3996 not NULL and this value
3997 is 0 then the kernarg
4000 * If the kernarg pointer in
4001 the dispatch packet is
4002 not NULL and this value
4003 is not 0 then the value
4004 specifies the kernarg
4005 memory size in bytes. It
4006 is recommended to provide
4007 a value as it may be used
4008 by CP to optimize making
4010 visible to the kernel
4013 127:96 4 bytes Reserved, must be 0.
4014 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
4017 descriptor to kernel's
4018 entry point instruction
4019 which must be 256 byte
4021 351:272 20 Reserved, must be 0.
4023 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9
4024 Reserved, must be 0.
4027 program settings used by
4029 ``COMPUTE_PGM_RSRC3``
4032 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
4035 program settings used by
4037 ``COMPUTE_PGM_RSRC3``
4040 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
4041 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
4042 program settings used by
4044 ``COMPUTE_PGM_RSRC1``
4047 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
4048 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
4049 program settings used by
4051 ``COMPUTE_PGM_RSRC2``
4054 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
4055 458:448 7 bits *See separate bits below.* Enable the setup of the
4056 SGPR user data registers
4058 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4060 The total number of SGPR
4062 requested must not exceed
4063 16 and match value in
4064 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
4065 Any requests beyond 16
4067 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties*
4069 :ref:`amdgpu-processor-table`
4070 specifies *Architected flat
4071 scratch* then not supported
4073 >449 1 bit ENABLE_SGPR_DISPATCH_PTR
4074 >450 1 bit ENABLE_SGPR_QUEUE_PTR
4075 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR
4076 >452 1 bit ENABLE_SGPR_DISPATCH_ID
4077 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties*
4079 :ref:`amdgpu-processor-table`
4080 specifies *Architected flat
4081 scratch* then not supported
4083 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT
4085 457:455 3 bits Reserved, must be 0.
4086 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9
4087 Reserved, must be 0.
4090 wavefront size 64 mode.
4092 native wavefront size
4094 459 1 bit USES_DYNAMIC_STACK Indicates if the generated
4095 machine code is using a
4096 dynamically sized stack.
4097 This is only set in code
4098 object v5 and later.
4099 463:460 1 bit Reserved, must be 0.
4100 464 1 bit RESERVED_464 Deprecated, must be 0.
4101 467:465 3 bits Reserved, must be 0.
4102 468 1 bit RESERVED_468 Deprecated, must be 0.
4103 469:471 3 bits Reserved, must be 0.
4104 511:472 5 bytes Reserved, must be 0.
4105 512 **Total size 64 bytes.**
4106 ======= ====================================================================
4110 .. table:: compute_pgm_rsrc1 for GFX6-GFX11
4111 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table
4113 ======= ======= =============================== ===========================================================================
4114 Bits Size Field Name Description
4115 ======= ======= =============================== ===========================================================================
4116 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
4117 blocks used by each work-item;
4118 granularity is device
4123 - max(0, ceil(vgprs_used / 4) - 1)
4126 - vgprs_used = align(arch_vgprs, 4)
4128 - max(0, ceil(vgprs_used / 8) - 1)
4129 GFX10-GFX11 (wavefront size 64)
4131 - max(0, ceil(vgprs_used / 4) - 1)
4132 GFX10-GFX11 (wavefront size 32)
4134 - max(0, ceil(vgprs_used / 8) - 1)
4136 Where vgprs_used is defined
4137 as the highest VGPR number
4138 explicitly referenced plus
4141 Used by CP to set up
4142 ``COMPUTE_PGM_RSRC1.VGPRS``.
4145 :ref:`amdgpu-assembler`
4147 automatically for the
4148 selected processor from
4149 values provided to the
4150 `.amdhsa_kernel` directive
4152 `.amdhsa_next_free_vgpr`
4153 nested directive (see
4154 :ref:`amdhsa-kernel-directives-table`).
4155 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4156 blocks used by a wavefront;
4157 granularity is device
4162 - max(0, ceil(sgprs_used / 8) - 1)
4165 - 2 * max(0, ceil(sgprs_used / 16) - 1)
4167 Reserved, must be 0.
4172 defined as the highest
4173 SGPR number explicitly
4174 referenced plus one, plus
4175 a target specific number
4176 of additional special
4178 FLAT_SCRATCH (GFX7+) and
4179 XNACK_MASK (GFX8+), and
4182 limitations. It does not
4183 include the 16 SGPRs added
4184 if a trap handler is
4188 limitations and special
4189 SGPR layout are defined in
4191 documentation, which can
4193 :ref:`amdgpu-processors`
4196 Used by CP to set up
4197 ``COMPUTE_PGM_RSRC1.SGPRS``.
4200 :ref:`amdgpu-assembler`
4202 automatically for the
4203 selected processor from
4204 values provided to the
4205 `.amdhsa_kernel` directive
4207 `.amdhsa_next_free_sgpr`
4208 and `.amdhsa_reserve_*`
4209 nested directives (see
4210 :ref:`amdhsa-kernel-directives-table`).
4211 11:10 2 bits PRIORITY Must be 0.
4213 Start executing wavefront
4214 at the specified priority.
4216 CP is responsible for
4218 ``COMPUTE_PGM_RSRC1.PRIORITY``.
4219 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
4220 with specified rounding
4223 precision floating point
4226 Floating point rounding
4227 mode values are defined in
4228 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4230 Used by CP to set up
4231 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4232 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
4233 with specified rounding
4234 denorm mode for half/double (16
4235 and 64-bit) floating point
4236 precision floating point
4239 Floating point rounding
4240 mode values are defined in
4241 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4243 Used by CP to set up
4244 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4245 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
4246 with specified denorm mode
4249 precision floating point
4252 Floating point denorm mode
4253 values are defined in
4254 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4256 Used by CP to set up
4257 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4258 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
4259 with specified denorm mode
4261 and 64-bit) floating point
4262 precision floating point
4265 Floating point denorm mode
4266 values are defined in
4267 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4269 Used by CP to set up
4270 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4271 20 1 bit PRIV Must be 0.
4273 Start executing wavefront
4274 in privilege trap handler
4277 CP is responsible for
4279 ``COMPUTE_PGM_RSRC1.PRIV``.
4280 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
4281 with DX10 clamp mode
4282 enabled. Used by the vector
4283 ALU to force DX10 style
4284 treatment of NaN's (when
4285 set, clamp NaN to zero,
4289 Used by CP to set up
4290 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4291 22 1 bit DEBUG_MODE Must be 0.
4293 Start executing wavefront
4294 in single step mode.
4296 CP is responsible for
4298 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4299 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
4301 enabled. Floating point
4302 opcodes that support
4303 exception flag gathering
4304 will quiet and propagate
4305 signaling-NaN inputs per
4306 IEEE 754-2008. Min_dx10 and
4307 max_dx10 become IEEE
4308 754-2008 compliant due to
4309 signaling-NaN propagation
4312 Used by CP to set up
4313 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4314 24 1 bit BULKY Must be 0.
4316 Only one work-group allowed
4317 to execute on a compute
4320 CP is responsible for
4322 ``COMPUTE_PGM_RSRC1.BULKY``.
4323 25 1 bit CDBG_USER Must be 0.
4325 Flag that can be used to
4326 control debugging code.
4328 CP is responsible for
4330 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4331 26 1 bit FP16_OVFL GFX6-GFX8
4332 Reserved, must be 0.
4334 Wavefront starts execution
4335 with specified fp16 overflow
4338 - If 0, fp16 overflow generates
4340 - If 1, fp16 overflow that is the
4341 result of an +/-INF input value
4342 or divide by 0 produces a +/-INF,
4343 otherwise clamps computed
4344 overflow to +/-MAX_FP16 as
4347 Used by CP to set up
4348 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4349 28:27 2 bits Reserved, must be 0.
4350 29 1 bit WGP_MODE GFX6-GFX9
4351 Reserved, must be 0.
4353 - If 0 execute work-groups in
4354 CU wavefront execution mode.
4355 - If 1 execute work-groups on
4356 in WGP wavefront execution mode.
4358 See :ref:`amdgpu-amdhsa-memory-model`.
4360 Used by CP to set up
4361 ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4362 30 1 bit MEM_ORDERED GFX6-GFX9
4363 Reserved, must be 0.
4365 Controls the behavior of the
4366 s_waitcnt's vmcnt and vscnt
4369 - If 0 vmcnt reports completion
4370 of load and atomic with return
4371 out of order with sample
4372 instructions, and the vscnt
4373 reports the completion of
4374 store and atomic without
4376 - If 1 vmcnt reports completion
4377 of load, atomic with return
4378 and sample instructions in
4379 order, and the vscnt reports
4380 the completion of store and
4381 atomic without return in order.
4383 Used by CP to set up
4384 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4385 31 1 bit FWD_PROGRESS GFX6-GFX9
4386 Reserved, must be 0.
4388 - If 0 execute SIMD wavefronts
4389 using oldest first policy.
4390 - If 1 execute SIMD wavefronts to
4391 ensure wavefronts will make some
4394 Used by CP to set up
4395 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4396 32 **Total size 4 bytes**
4397 ======= ===================================================================================================================
4401 .. table:: compute_pgm_rsrc2 for GFX6-GFX11
4402 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table
4404 ======= ======= =============================== ===========================================================================
4405 Bits Size Field Name Description
4406 ======= ======= =============================== ===========================================================================
4407 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the
4409 * If the *Target Properties*
4411 :ref:`amdgpu-processor-table`
4414 scratch* then enable the
4416 wavefront scratch offset
4417 system register (see
4418 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4419 * If the *Target Properties*
4421 :ref:`amdgpu-processor-table`
4422 specifies *Architected
4423 flat scratch* then enable
4425 FLAT_SCRATCH register
4427 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4429 Used by CP to set up
4430 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4431 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
4433 registers requested. This
4434 number must be greater than
4435 or equal to the number of user
4436 data registers enabled.
4438 Used by CP to set up
4439 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4440 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
4443 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4444 which is set by the CP if
4445 the runtime has installed a
4447 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
4448 system SGPR register for
4449 the work-group id in the X
4451 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4453 Used by CP to set up
4454 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4455 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
4456 system SGPR register for
4457 the work-group id in the Y
4459 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4461 Used by CP to set up
4462 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4463 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
4464 system SGPR register for
4465 the work-group id in the Z
4467 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4469 Used by CP to set up
4470 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4471 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
4472 system SGPR register for
4473 work-group information (see
4474 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4476 Used by CP to set up
4477 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4478 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
4479 VGPR system registers used
4480 for the work-item ID.
4481 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4484 Used by CP to set up
4485 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4486 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
4488 Wavefront starts execution
4490 exceptions enabled which
4491 are generated when L1 has
4492 witnessed a thread access
4496 CP is responsible for
4497 filling in the address
4499 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4500 according to what the
4502 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
4504 Wavefront starts execution
4505 with memory violation
4506 exceptions exceptions
4507 enabled which are generated
4508 when a memory violation has
4509 occurred for this wavefront from
4511 (write-to-read-only-memory,
4512 mis-aligned atomic, LDS
4513 address out of range,
4514 illegal address, etc.).
4518 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4519 according to what the
4521 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
4523 CP uses the rounded value
4524 from the dispatch packet,
4525 not this value, as the
4526 dispatch may contain
4527 dynamically allocated group
4528 segment memory. CP writes
4530 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4532 Amount of group segment
4533 (LDS) to allocate for each
4534 work-group. Granularity is
4538 roundup(lds-size / (64 * 4))
4540 roundup(lds-size / (128 * 4))
4542 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
4543 _INVALID_OPERATION with specified exceptions
4546 Used by CP to set up
4547 ``COMPUTE_PGM_RSRC2.EXCP_EN``
4548 (set from bits 0..6).
4552 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
4553 _SOURCE input operands is a
4555 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
4556 _DIVISION_BY_ZERO Zero
4557 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
4559 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
4561 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
4563 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
4564 _ZERO (rcp_iflag_f32 instruction
4566 31 1 bit Reserved, must be 0.
4567 32 **Total size 4 bytes.**
4568 ======= ===================================================================================================================
4572 .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
4573 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4575 ======= ======= =============================== ===========================================================================
4576 Bits Size Field Name Description
4577 ======= ======= =============================== ===========================================================================
4578 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4.
4579 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4580 63 - accum-offset = 256.
4581 6:15 10 Reserved, must be 0.
4583 16 1 bit TG_SPLIT - If 0 the waves of a work-group are
4584 launched in the same CU.
4585 - If 1 the waves of a work-group can be
4586 launched in different CUs. The waves
4587 cannot use S_BARRIER or LDS.
4588 17:31 15 Reserved, must be 0.
4590 32 **Total size 4 bytes.**
4591 ======= ===================================================================================================================
4595 .. table:: compute_pgm_rsrc3 for GFX10-GFX11
4596 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table
4598 ======= ======= =============================== ===========================================================================
4599 Bits Size Field Name Description
4600 ======= ======= =============================== ===========================================================================
4601 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPR blocks when executing in subvector mode. For
4602 wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
4603 of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
4604 not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
4605 9:4 6 bits INST_PREF_SIZE GFX10
4606 Reserved, must be 0.
4608 Number of instruction bytes to prefetch, starting at the kernel's entry
4609 point instruction, before wavefront starts execution. The value is 0..63
4610 with a granularity of 128 bytes.
4611 10 1 bit TRAP_ON_START GFX10
4612 Reserved, must be 0.
4616 If 1, wavefront starts execution by trapping into the trap handler.
4618 CP is responsible for filling in the trap on start bit in
4619 ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
4621 11 1 bit TRAP_ON_END GFX10
4622 Reserved, must be 0.
4626 If 1, wavefront execution terminates by trapping into the trap handler.
4628 CP is responsible for filling in the trap on end bit in
4629 ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
4630 30:12 19 bits Reserved, must be 0.
4631 31 1 bit IMAGE_OP GFX10
4632 Reserved, must be 0.
4634 If 1, the kernel execution contains image instructions. If executed as
4635 part of a graphics pipeline, image read instructions will stall waiting
4636 for any necessary ``WAIT_SYNC`` fence to be performed in order to
4637 indicate that earlier pipeline stages have completed writing to the
4640 Not used for compute kernels that are not part of a graphics pipeline and
4642 32 **Total size 4 bytes.**
4643 ======= ===================================================================================================================
4647 .. table:: Floating Point Rounding Mode Enumeration Values
4648 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4650 ====================================== ===== ==============================
4651 Enumeration Name Value Description
4652 ====================================== ===== ==============================
4653 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
4654 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
4655 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
4656 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
4657 ====================================== ===== ==============================
4661 .. table:: Floating Point Denorm Mode Enumeration Values
4662 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4664 ====================================== ===== ====================================
4665 Enumeration Name Value Description
4666 ====================================== ===== ====================================
4667 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination Denorms
4668 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
4669 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
4670 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
4671 ====================================== ===== ====================================
4673 Denormal flushing is sign respecting. i.e. the behavior expected by
4674 ``"denormal-fp-math"="preserve-sign"``. The behavior is undefined with
4675 ``"denormal-fp-math"="positive-zero"``
4679 .. table:: System VGPR Work-Item ID Enumeration Values
4680 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4682 ======================================== ===== ============================
4683 Enumeration Name Value Description
4684 ======================================== ===== ============================
4685 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
4687 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
4689 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
4691 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
4692 ======================================== ===== ============================
4694 .. _amdgpu-amdhsa-initial-kernel-execution-state:
4696 Initial Kernel Execution State
4697 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4699 This section defines the register state that will be set up by the packet
4700 processor prior to the start of execution of every wavefront. This is limited by
4701 the constraints of the hardware controllers of CP/ADC/SPI.
4703 The order of the SGPR registers is defined, but the compiler can specify which
4704 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4705 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4706 for enabled registers are dense starting at SGPR0: the first enabled register is
4707 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4710 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4711 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4712 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4713 actually initialized. These are then immediately followed by the System SGPRs
4714 that are set up by ADC/SPI and can have different values for each wavefront of
4717 SGPR register initial state is defined in
4718 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4720 .. table:: SGPR Register Set Up Order
4721 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4723 ========== ========================== ====== ==============================
4724 SGPR Order Name Number Description
4725 (kernel descriptor enable of
4727 ========== ========================== ====== ==============================
4728 First Private Segment Buffer 4 See
4729 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4731 then Dispatch Ptr 2 64-bit address of AQL dispatch
4732 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
4734 then Queue Ptr 2 64-bit address of amd_queue_t
4735 (enable_sgpr_queue_ptr) object for AQL queue on which
4736 the dispatch packet was
4738 then Kernarg Segment Ptr 2 64-bit address of Kernarg
4739 (enable_sgpr_kernarg segment. This is directly
4740 _segment_ptr) copied from the
4741 kernarg_address in the kernel
4744 Having CP load it once avoids
4745 loading it at the beginning of
4747 then Dispatch Id 2 64-bit Dispatch ID of the
4748 (enable_sgpr_dispatch_id) dispatch packet being
4750 then Flat Scratch Init 2 See
4751 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4753 then Private Segment Size 1 The 32-bit byte size of a
4754 (enable_sgpr_private single work-item's memory
4755 _segment_size) allocation. This is the
4756 value from the kernel
4757 dispatch packet Private
4758 Segment Byte Size rounded up
4759 by CP to a multiple of
4762 Having CP load it once avoids
4763 loading it at the beginning of
4766 This is not used for
4767 GFX7-GFX8 since it is the same
4768 value as the second SGPR of
4769 Flat Scratch Init. However, it
4770 may be needed for GFX9-GFX11 which
4771 changes the meaning of the
4772 Flat Scratch Init value.
4773 then Work-Group Id X 1 32-bit work-group id in X
4774 (enable_sgpr_workgroup_id dimension of grid for
4776 then Work-Group Id Y 1 32-bit work-group id in Y
4777 (enable_sgpr_workgroup_id dimension of grid for
4779 then Work-Group Id Z 1 32-bit work-group id in Z
4780 (enable_sgpr_workgroup_id dimension of grid for
4782 then Work-Group Info 1 {first_wavefront, 14'b0000,
4783 (enable_sgpr_workgroup ordered_append_term[10:0],
4784 _info) threadgroup_size_in_wavefronts[5:0]}
4785 then Scratch Wavefront Offset 1 See
4786 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4787 _segment_wavefront_offset) and
4788 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4789 ========== ========================== ====== ==============================
4791 The order of the VGPR registers is defined, but the compiler can specify which
4792 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4793 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4794 for enabled registers are dense starting at VGPR0: the first enabled register is
4795 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4798 There are different methods used for the VGPR initial state:
4800 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4801 specifies otherwise, a separate VGPR register is used per work-item ID. The
4802 VGPR register initial state for this method is defined in
4803 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4804 * If *Target Properties* column of :ref:`amdgpu-processor-table`
4805 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4806 for all work-item IDs. The register layout for this method is defined in
4807 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4809 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4810 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4812 ========== ========================== ====== ==============================
4813 VGPR Order Name Number Description
4814 (kernel descriptor enable of
4816 ========== ========================== ====== ==============================
4817 First Work-Item Id X 1 32-bit work-item id in X
4818 (Always initialized) dimension of work-group for
4820 then Work-Item Id Y 1 32-bit work-item id in Y
4821 (enable_vgpr_workitem_id dimension of work-group for
4822 > 0) wavefront lane.
4823 then Work-Item Id Z 1 32-bit work-item id in Z
4824 (enable_vgpr_workitem_id dimension of work-group for
4825 > 1) wavefront lane.
4826 ========== ========================== ====== ==============================
4830 .. table:: Register Layout for Packed Work-Item ID Method
4831 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4833 ======= ======= ================ =========================================
4834 Bits Size Field Name Description
4835 ======= ======= ================ =========================================
4836 0:9 10 bits Work-Item Id X Work-item id in X
4837 dimension of work-group for
4842 10:19 10 bits Work-Item Id Y Work-item id in Y
4843 dimension of work-group for
4846 Initialized if enable_vgpr_workitem_id >
4847 0, otherwise set to 0.
4848 20:29 10 bits Work-Item Id Z Work-item id in Z
4849 dimension of work-group for
4852 Initialized if enable_vgpr_workitem_id >
4853 1, otherwise set to 0.
4854 30:31 2 bits Reserved, set to 0.
4855 ======= ======= ================ =========================================
4857 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4859 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4861 2. Work-group Id registers X, Y, Z are set by ADC which supports any
4862 combination including none.
4863 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4864 its value cannot be included with the flat scratch init value which is per
4865 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
4866 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4868 5. Flat Scratch register pair initialization is described in
4869 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4871 The global segment can be accessed either using buffer instructions (GFX6 which
4872 has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
4873 instructions (GFX9-GFX11).
4875 If buffer operations are used, then the compiler can generate a V# with the
4876 following properties:
4880 * ATC: 1 if IOMMU present (such as APU)
4882 * MTYPE set to support memory coherence that matches the runtime (such as CC for
4883 APU and NC for dGPU).
4885 .. _amdgpu-amdhsa-kernel-prolog:
4890 The compiler performs initialization in the kernel prologue depending on the
4891 target and information about things like stack usage in the kernel and called
4892 functions. Some of this initialization requires the compiler to request certain
4893 User and System SGPRs be present in the
4894 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4895 :ref:`amdgpu-amdhsa-kernel-descriptor`.
4897 .. _amdgpu-amdhsa-kernel-prolog-cfi:
4902 1. The CFI return address is undefined.
4904 2. The CFI CFA is defined using an expression which evaluates to a location
4905 description that comprises one memory location description for the
4906 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4908 .. _amdgpu-amdhsa-kernel-prolog-m0:
4914 The M0 register must be initialized with a value at least the total LDS size
4915 if the kernel may access LDS via DS or flat operations. Total LDS size is
4916 available in dispatch packet. For M0, it is also possible to use maximum
4917 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4920 The M0 register is not used for range checking LDS accesses and so does not
4921 need to be initialized in the prolog.
4923 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4928 If the kernel has function calls it must set up the ABI stack pointer described
4929 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4930 SGPR32 to the unswizzled scratch offset of the address past the last local
4933 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4938 If the kernel needs a frame pointer for the reasons defined in
4939 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4940 kernel prolog. If a frame pointer is not required then all uses of the frame
4941 pointer are replaced with immediate ``0`` offsets.
4943 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4948 There are different methods used for initializing flat scratch:
4950 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4951 specifies *Does not support generic address space*:
4953 Flat scratch is not supported and there is no flat scratch register pair.
4955 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4956 specifies *Offset flat scratch*:
4958 If the kernel or any function it calls may use flat operations to access
4959 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4960 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4961 Scratch Wavefront Offset SGPR registers (see
4962 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4964 1. The low word of Flat Scratch Init is the 32-bit byte offset from
4965 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4966 being managed by SPI for the queue executing the kernel dispatch. This is
4967 the same value used in the Scratch Segment Buffer V# base address.
4969 CP obtains this from the runtime. (The Scratch Segment Buffer base address
4970 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4972 The prolog must add the value of Scratch Wavefront Offset to get the
4973 wavefront's byte scratch backing memory offset from
4974 ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4976 The Scratch Wavefront Offset must also be used as an offset with Private
4977 segment address when using the Scratch Segment Buffer.
4979 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4980 shifted by 8 before moving into FLAT_SCRATCH_HI.
4982 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4983 SGPRn is the highest numbered SGPR allocated to the wavefront).
4984 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4985 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4986 FLAT SCRATCH BASE in flat memory instructions that access the scratch
4988 2. The second word of Flat Scratch Init is 32-bit byte size of a single
4989 work-items scratch memory usage.
4991 CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4992 checks that the value in the kernel dispatch packet Private Segment Byte
4993 Size is not larger and requests the runtime to increase the queue's scratch
4996 CP directly loads from the kernel dispatch packet Private Segment Byte Size
4997 field and rounds up to a multiple of DWORD. Having CP load it once avoids
4998 loading it at the beginning of every wavefront.
5000 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
5001 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
5002 in flat memory instructions.
5004 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5005 specifies *Absolute flat scratch*:
5007 If the kernel or any function it calls may use flat operations to access
5008 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
5009 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
5010 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
5011 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
5013 The Flat Scratch Init is the 64-bit address of the base of scratch backing
5014 memory being managed by SPI for the queue executing the kernel dispatch.
5016 CP obtains this from the runtime.
5018 The kernel prolog must add the value of the wave's Scratch Wavefront Offset
5019 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
5020 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
5021 memory instructions.
5023 The Scratch Wavefront Offset must also be used as an offset with Private
5024 segment address when using the Scratch Segment Buffer (see
5025 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
5027 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
5028 specifies *Architected flat scratch*:
5030 If ENABLE_PRIVATE_SEGMENT is enabled in
5031 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` then the FLAT_SCRATCH
5032 register pair will be initialized to the 64-bit address of the base of scratch
5033 backing memory being managed by SPI for the queue executing the kernel
5034 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
5035 flat scratch base in flat memory instructions.
5037 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
5039 Private Segment Buffer
5040 ++++++++++++++++++++++
5042 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
5043 *Architected flat scratch* then a Private Segment Buffer is not supported.
5044 Instead the flat SCRATCH instructions are used.
5046 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
5047 that are used as a V# to access scratch. CP uses the value provided by the
5048 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
5049 access the private memory space using a segment address. See
5050 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
5052 The scratch V# is a four-aligned SGPR and always selected for the kernel as
5055 - If it is known during instruction selection that there is stack usage,
5056 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
5057 optimizations are disabled (``-O0``), if stack objects already exist (for
5058 locals, etc.), or if there are any function calls.
5060 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
5061 are reserved for the tentative scratch V#. These will be used if it is
5062 determined that spilling is needed.
5064 - If no use is made of the tentative scratch V#, then it is unreserved,
5065 and the register count is determined ignoring it.
5066 - If use is made of the tentative scratch V#, then its register numbers
5067 are shifted to the first four-aligned SGPR index after the highest one
5068 allocated by the register allocator, and all uses are updated. The
5069 register count includes them in the shifted location.
5070 - In either case, if the processor has the SGPR allocation bug, the
5071 tentative allocation is not shifted or unreserved in order to ensure
5072 the register count is higher to workaround the bug.
5076 This approach of using a tentative scratch V# and shifting the register
5077 numbers if used avoids having to perform register allocation a second
5078 time if the tentative V# is eliminated. This is more efficient and
5079 avoids the problem that the second register allocation may perform
5080 spilling which will fail as there is no longer a scratch V#.
5082 When the kernel prolog code is being emitted it is known whether the scratch V#
5083 described above is actually used. If it is, the prolog code must set it up by
5084 copying the Private Segment Buffer to the scratch V# registers and then adding
5085 the Private Segment Wavefront Offset to the queue base address in the V#. The
5086 result is a V# with a base address pointing to the beginning of the wavefront
5087 scratch backing memory.
5089 The Private Segment Buffer is always requested, but the Private Segment
5090 Wavefront Offset is only requested if it is used (see
5091 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5093 .. _amdgpu-amdhsa-memory-model:
5098 This section describes the mapping of the LLVM memory model onto AMDGPU machine
5099 code (see :ref:`memmodel`).
5101 The AMDGPU backend supports the memory synchronization scopes specified in
5102 :ref:`amdgpu-memory-scopes`.
5104 The code sequences used to implement the memory model specify the order of
5105 instructions that a single thread must execute. The ``s_waitcnt`` and cache
5106 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
5107 to other memory instructions executed by the same thread. This allows them to be
5108 moved earlier or later which can allow them to be combined with other instances
5109 of the same instruction, or hoisted/sunk out of loops to improve performance.
5110 Only the instructions related to the memory model are given; additional
5111 ``s_waitcnt`` instructions are required to ensure registers are defined before
5112 being used. These may be able to be combined with the memory model ``s_waitcnt``
5113 instructions as described above.
5115 The AMDGPU backend supports the following memory models:
5117 HSA Memory Model [HSA]_
5118 The HSA memory model uses a single happens-before relation for all address
5119 spaces (see :ref:`amdgpu-address-spaces`).
5120 OpenCL Memory Model [OpenCL]_
5121 The OpenCL memory model which has separate happens-before relations for the
5122 global and local address spaces. Only a fence specifying both global and
5123 local address space, and seq_cst instructions join the relationships. Since
5124 the LLVM ``memfence`` instruction does not allow an address space to be
5125 specified the OpenCL fence has to conservatively assume both local and
5126 global address space was specified. However, optimizations can often be
5127 done to eliminate the additional ``s_waitcnt`` instructions when there are
5128 no intervening memory instructions which access the corresponding address
5129 space. The code sequences in the table indicate what can be omitted for the
5130 OpenCL memory. The target triple environment is used to determine if the
5131 source language is OpenCL (see :ref:`amdgpu-opencl`).
5133 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
5136 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
5137 termed vector memory operations.
5139 Private address space uses ``buffer_load/store`` using the scratch V#
5140 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
5141 is accessing the memory, atomic memory orderings are not meaningful, and all
5142 accesses are treated as non-atomic.
5144 Constant address space uses ``buffer/global_load`` instructions (or equivalent
5145 scalar memory instructions). Since the constant address space contents do not
5146 change during the execution of a kernel dispatch it is not legal to perform
5147 stores, and atomic memory orderings are not meaningful, and all accesses are
5148 treated as non-atomic.
5150 A memory synchronization scope wider than work-group is not meaningful for the
5151 group (LDS) address space and is treated as work-group.
5153 The memory model does not support the region address space which is treated as
5156 Acquire memory ordering is not meaningful on store atomic instructions and is
5157 treated as non-atomic.
5159 Release memory ordering is not meaningful on load atomic instructions and is
5160 treated a non-atomic.
5162 Acquire-release memory ordering is not meaningful on load or store atomic
5163 instructions and is treated as acquire and release respectively.
5165 The memory order also adds the single thread optimization constraints defined in
5167 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
5169 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
5170 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
5172 ============ ==============================================================
5173 LLVM Memory Optimization Constraints
5175 ============ ==============================================================
5178 acquire - If a load atomic/atomicrmw then no following load/load
5179 atomic/store/store atomic/atomicrmw/fence instruction can be
5180 moved before the acquire.
5181 - If a fence then same as load atomic, plus no preceding
5182 associated fence-paired-atomic can be moved after the fence.
5183 release - If a store atomic/atomicrmw then no preceding load/load
5184 atomic/store/store atomic/atomicrmw/fence instruction can be
5185 moved after the release.
5186 - If a fence then same as store atomic, plus no following
5187 associated fence-paired-atomic can be moved before the
5189 acq_rel Same constraints as both acquire and release.
5190 seq_cst - If a load atomic then same constraints as acquire, plus no
5191 preceding sequentially consistent load atomic/store
5192 atomic/atomicrmw/fence instruction can be moved after the
5194 - If a store atomic then the same constraints as release, plus
5195 no following sequentially consistent load atomic/store
5196 atomic/atomicrmw/fence instruction can be moved before the
5198 - If an atomicrmw/fence then same constraints as acq_rel.
5199 ============ ==============================================================
5201 The code sequences used to implement the memory model are defined in the
5204 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5205 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5206 * :ref:`amdgpu-amdhsa-memory-model-gfx940`
5207 * :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
5209 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5211 Memory Model GFX6-GFX9
5212 ++++++++++++++++++++++
5216 * Each agent has multiple shader arrays (SA).
5217 * Each SA has multiple compute units (CU).
5218 * Each CU has multiple SIMDs that execute wavefronts.
5219 * The wavefronts for a single work-group are executed in the same CU but may be
5220 executed by different SIMDs.
5221 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
5223 * All LDS operations of a CU are performed as wavefront wide operations in a
5224 global order and involve no caching. Completion is reported to a wavefront in
5226 * The LDS memory has multiple request queues shared by the SIMDs of a
5227 CU. Therefore, the LDS operations performed by different wavefronts of a
5228 work-group can be reordered relative to each other, which can result in
5229 reordering the visibility of vector memory operations with respect to LDS
5230 operations of other wavefronts in the same work-group. A ``s_waitcnt
5231 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5232 vector memory operations between wavefronts of a work-group, but not between
5233 operations performed by the same wavefront.
5234 * The vector memory operations are performed as wavefront wide operations and
5235 completion is reported to a wavefront in execution order. The exception is
5236 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5237 vector memory order if they access LDS memory, and out of LDS operation order
5238 if they access global memory.
5239 * The vector memory operations access a single vector L1 cache shared by all
5240 SIMDs a CU. Therefore, no special action is required for coherence between the
5241 lanes of a single wavefront, or for coherence between wavefronts in the same
5242 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5243 wavefronts executing in different work-groups as they may be executing on
5245 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
5246 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5247 scalar operations are used in a restricted way so do not impact the memory
5248 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5249 * The vector and scalar memory operations use an L2 cache shared by all CUs on
5251 * The L2 cache has independent channels to service disjoint ranges of virtual
5253 * Each CU has a separate request queue per channel. Therefore, the vector and
5254 scalar memory operations performed by wavefronts executing in different
5255 work-groups (which may be executing on different CUs) of an agent can be
5256 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
5257 ensure synchronization between vector memory operations of different CUs. It
5258 ensures a previous vector memory operation has completed before executing a
5259 subsequent vector memory or LDS operation and so can be used to meet the
5260 requirements of acquire and release.
5261 * The L2 cache can be kept coherent with other agents on some targets, or ranges
5262 of virtual addresses can be set up to bypass it to ensure system coherence.
5264 Scalar memory operations are only used to access memory that is proven to not
5265 change during the execution of the kernel dispatch. This includes constant
5266 address space and global address space for program scope ``const`` variables.
5267 Therefore, the kernel machine code does not have to maintain the scalar cache to
5268 ensure it is coherent with the vector caches. The scalar and vector caches are
5269 invalidated between kernel dispatches by CP since constant address space data
5270 may change between kernel dispatch executions. See
5271 :ref:`amdgpu-amdhsa-memory-spaces`.
5273 The one exception is if scalar writes are used to spill SGPR registers. In this
5274 case the AMDGPU backend ensures the memory location used to spill is never
5275 accessed by vector memory operations at the same time. If scalar writes are used
5276 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
5277 return since the locations may be used for vector memory instructions by a
5278 future wavefront that uses the same scratch area, or a function call that
5279 creates a frame at the same address, respectively. There is no need for a
5280 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
5282 For kernarg backing memory:
5284 * CP invalidates the L1 cache at the start of each kernel dispatch.
5285 * On dGPU the kernarg backing memory is allocated in host memory accessed as
5286 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
5287 causes it to be treated as non-volatile and so is not invalidated by
5289 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
5290 and so the L2 cache will be coherent with the CPU and other agents.
5292 Scratch backing memory (which is used for the private address space) is accessed
5293 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
5294 only accessed by a single thread, and is always write-before-read, there is
5295 never a need to invalidate these entries from the L1 cache. Hence all cache
5296 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
5298 The code sequences used to implement the memory model for GFX6-GFX9 are defined
5299 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
5301 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
5302 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
5304 ============ ============ ============== ========== ================================
5305 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
5306 Ordering Sync Scope Address GFX6-GFX9
5308 ============ ============ ============== ========== ================================
5310 ------------------------------------------------------------------------------------
5311 load *none* *none* - global - !volatile & !nontemporal
5313 - private 1. buffer/global/flat_load
5315 - !volatile & nontemporal
5317 1. buffer/global/flat_load
5322 1. buffer/global/flat_load
5324 2. s_waitcnt vmcnt(0)
5326 - Must happen before
5327 any following volatile
5338 load *none* *none* - local 1. ds_load
5339 store *none* *none* - global - !volatile & !nontemporal
5341 - private 1. buffer/global/flat_store
5343 - !volatile & nontemporal
5345 1. buffer/global/flat_store
5350 1. buffer/global/flat_store
5351 2. s_waitcnt vmcnt(0)
5353 - Must happen before
5354 any following volatile
5365 store *none* *none* - local 1. ds_store
5366 **Unordered Atomic**
5367 ------------------------------------------------------------------------------------
5368 load atomic unordered *any* *any* *Same as non-atomic*.
5369 store atomic unordered *any* *any* *Same as non-atomic*.
5370 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
5371 **Monotonic Atomic**
5372 ------------------------------------------------------------------------------------
5373 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load
5375 - workgroup - generic
5376 load atomic monotonic - agent - global 1. buffer/global/flat_load
5377 - system - generic glc=1
5378 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
5379 - wavefront - generic
5383 store atomic monotonic - singlethread - local 1. ds_store
5386 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
5387 - wavefront - generic
5391 atomicrmw monotonic - singlethread - local 1. ds_atomic
5395 ------------------------------------------------------------------------------------
5396 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
5399 load atomic acquire - workgroup - global 1. buffer/global_load
5400 load atomic acquire - workgroup - local 1. ds/flat_load
5401 - generic 2. s_waitcnt lgkmcnt(0)
5404 - Must happen before
5413 older than a local load
5417 load atomic acquire - agent - global 1. buffer/global_load
5419 2. s_waitcnt vmcnt(0)
5421 - Must happen before
5429 3. buffer_wbinvl1_vol
5431 - Must happen before
5441 load atomic acquire - agent - generic 1. flat_load glc=1
5442 - system 2. s_waitcnt vmcnt(0) &
5447 - Must happen before
5450 - Ensures the flat_load
5455 3. buffer_wbinvl1_vol
5457 - Must happen before
5467 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
5470 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
5471 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
5472 - generic 2. s_waitcnt lgkmcnt(0)
5475 - Must happen before
5488 atomicrmw acquire - agent - global 1. buffer/global_atomic
5489 - system 2. s_waitcnt vmcnt(0)
5491 - Must happen before
5500 3. buffer_wbinvl1_vol
5502 - Must happen before
5512 atomicrmw acquire - agent - generic 1. flat_atomic
5513 - system 2. s_waitcnt vmcnt(0) &
5518 - Must happen before
5527 3. buffer_wbinvl1_vol
5529 - Must happen before
5539 fence acquire - singlethread *none* *none*
5541 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5546 - However, since LLVM
5571 fence-paired-atomic).
5572 - Must happen before
5583 fence-paired-atomic.
5585 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
5592 - However, since LLVM
5600 - Could be split into
5609 - s_waitcnt vmcnt(0)
5620 fence-paired-atomic).
5621 - s_waitcnt lgkmcnt(0)
5632 fence-paired-atomic).
5633 - Must happen before
5647 fence-paired-atomic.
5649 2. buffer_wbinvl1_vol
5651 - Must happen before any
5652 following global/generic
5662 ------------------------------------------------------------------------------------
5663 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
5666 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
5675 - Must happen before
5686 2. buffer/global/flat_store
5687 store atomic release - workgroup - local 1. ds_store
5688 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
5689 - system - generic vmcnt(0)
5695 - Could be split into
5704 - s_waitcnt vmcnt(0)
5711 - s_waitcnt lgkmcnt(0)
5718 - Must happen before
5729 2. buffer/global/flat_store
5730 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
5733 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
5742 - Must happen before
5753 2. buffer/global/flat_atomic
5754 atomicrmw release - workgroup - local 1. ds_atomic
5755 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
5756 - system - generic vmcnt(0)
5760 - Could be split into
5769 - s_waitcnt vmcnt(0)
5776 - s_waitcnt lgkmcnt(0)
5783 - Must happen before
5794 2. buffer/global/flat_atomic
5795 fence release - singlethread *none* *none*
5797 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5802 - However, since LLVM
5823 - Must happen before
5832 fence-paired-atomic).
5839 fence-paired-atomic.
5841 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
5852 - However, since LLVM
5867 - Could be split into
5876 - s_waitcnt vmcnt(0)
5883 - s_waitcnt lgkmcnt(0)
5890 - Must happen before
5899 fence-paired-atomic).
5906 fence-paired-atomic.
5908 **Acquire-Release Atomic**
5909 ------------------------------------------------------------------------------------
5910 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
5913 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
5922 - Must happen before
5933 2. buffer/global_atomic
5935 atomicrmw acq_rel - workgroup - local 1. ds_atomic
5936 2. s_waitcnt lgkmcnt(0)
5939 - Must happen before
5948 older than the local load
5952 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
5961 - Must happen before
5973 3. s_waitcnt lgkmcnt(0)
5976 - Must happen before
5985 older than a local load
5989 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
5994 - Could be split into
6003 - s_waitcnt vmcnt(0)
6010 - s_waitcnt lgkmcnt(0)
6017 - Must happen before
6028 2. buffer/global_atomic
6029 3. s_waitcnt vmcnt(0)
6031 - Must happen before
6040 4. buffer_wbinvl1_vol
6042 - Must happen before
6052 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
6057 - Could be split into
6066 - s_waitcnt vmcnt(0)
6073 - s_waitcnt lgkmcnt(0)
6080 - Must happen before
6092 3. s_waitcnt vmcnt(0) &
6097 - Must happen before
6106 4. buffer_wbinvl1_vol
6108 - Must happen before
6118 fence acq_rel - singlethread *none* *none*
6120 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6140 - Must happen before
6163 acquire-fence-paired-atomic)
6184 release-fence-paired-atomic).
6189 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
6196 - However, since LLVM
6204 - Could be split into
6213 - s_waitcnt vmcnt(0)
6220 - s_waitcnt lgkmcnt(0)
6227 - Must happen before
6232 global/local/generic
6241 acquire-fence-paired-atomic)
6253 global/local/generic
6262 release-fence-paired-atomic).
6267 2. buffer_wbinvl1_vol
6269 - Must happen before
6283 **Sequential Consistent Atomic**
6284 ------------------------------------------------------------------------------------
6285 load atomic seq_cst - singlethread - global *Same as corresponding
6286 - wavefront - local load atomic acquire,
6287 - generic except must generate
6288 all instructions even
6290 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
6306 lgkmcnt(0) and so do
6338 order. The s_waitcnt
6339 could be placed after
6343 make the s_waitcnt be
6350 instructions same as
6353 except must generate
6354 all instructions even
6356 load atomic seq_cst - workgroup - local *Same as corresponding
6357 load atomic acquire,
6358 except must generate
6359 all instructions even
6362 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
6363 - system - generic vmcnt(0)
6365 - Could be split into
6374 - s_waitcnt lgkmcnt(0)
6387 lgkmcnt(0) and so do
6390 - s_waitcnt vmcnt(0)
6435 order. The s_waitcnt
6436 could be placed after
6440 make the s_waitcnt be
6447 instructions same as
6450 except must generate
6451 all instructions even
6453 store atomic seq_cst - singlethread - global *Same as corresponding
6454 - wavefront - local store atomic release,
6455 - workgroup - generic except must generate
6456 - agent all instructions even
6457 - system for OpenCL.*
6458 atomicrmw seq_cst - singlethread - global *Same as corresponding
6459 - wavefront - local atomicrmw acq_rel,
6460 - workgroup - generic except must generate
6461 - agent all instructions even
6462 - system for OpenCL.*
6463 fence seq_cst - singlethread *none* *Same as corresponding
6464 - wavefront fence acq_rel,
6465 - workgroup except must generate
6466 - agent all instructions even
6467 - system for OpenCL.*
6468 ============ ============ ============== ========== ================================
6470 .. _amdgpu-amdhsa-memory-model-gfx90a:
6477 * Each agent has multiple shader arrays (SA).
6478 * Each SA has multiple compute units (CU).
6479 * Each CU has multiple SIMDs that execute wavefronts.
6480 * The wavefronts for a single work-group are executed in the same CU but may be
6481 executed by different SIMDs. The exception is when in tgsplit execution mode
6482 when the wavefronts may be executed by different SIMDs in different CUs.
6483 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
6484 executing on it. The exception is when in tgsplit execution mode when no LDS
6485 is allocated as wavefronts of the same work-group can be in different CUs.
6486 * All LDS operations of a CU are performed as wavefront wide operations in a
6487 global order and involve no caching. Completion is reported to a wavefront in
6489 * The LDS memory has multiple request queues shared by the SIMDs of a
6490 CU. Therefore, the LDS operations performed by different wavefronts of a
6491 work-group can be reordered relative to each other, which can result in
6492 reordering the visibility of vector memory operations with respect to LDS
6493 operations of other wavefronts in the same work-group. A ``s_waitcnt
6494 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6495 vector memory operations between wavefronts of a work-group, but not between
6496 operations performed by the same wavefront.
6497 * The vector memory operations are performed as wavefront wide operations and
6498 completion is reported to a wavefront in execution order. The exception is
6499 that ``flat_load/store/atomic`` instructions can report out of vector memory
6500 order if they access LDS memory, and out of LDS operation order if they access
6502 * The vector memory operations access a single vector L1 cache shared by all
6503 SIMDs a CU. Therefore:
6505 * No special action is required for coherence between the lanes of a single
6508 * No special action is required for coherence between wavefronts in the same
6509 work-group since they execute on the same CU. The exception is when in
6510 tgsplit execution mode as wavefronts of the same work-group can be in
6511 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6514 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6515 executing in different work-groups as they may be executing on different
6518 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
6519 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6520 scalar operations are used in a restricted way so do not impact the memory
6521 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6522 * The vector and scalar memory operations use an L2 cache shared by all CUs on
6525 * The L2 cache has independent channels to service disjoint ranges of virtual
6527 * Each CU has a separate request queue per channel. Therefore, the vector and
6528 scalar memory operations performed by wavefronts executing in different
6529 work-groups (which may be executing on different CUs), or the same
6530 work-group if executing in tgsplit mode, of an agent can be reordered
6531 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6532 synchronization between vector memory operations of different CUs. It
6533 ensures a previous vector memory operation has completed before executing a
6534 subsequent vector memory or LDS operation and so can be used to meet the
6535 requirements of acquire and release.
6536 * The L2 cache of one agent can be kept coherent with other agents by:
6537 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6538 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6539 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6541 * Any local memory cache lines will be automatically invalidated by writes
6542 from CUs associated with other L2 caches, or writes from the CPU, due to
6543 the cache probe caused by coherent requests. Coherent requests are caused
6544 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6545 XGMI, and by PCIe requests that are configured to be coherent requests.
6546 * XGMI accesses from the CPU to local memory may be cached on the CPU.
6547 Subsequent access from the GPU will automatically invalidate or writeback
6548 the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6549 * Since all work-groups on the same agent share the same L2, no L2
6550 invalidation or writeback is required for coherence.
6551 * To ensure coherence of local and remote memory writes of work-groups in
6552 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6553 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6554 ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6555 fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6556 remote fine grain memory) bypasses the L2, so both will never result in
6557 dirty L2 cache lines.
6558 * To ensure coherence of local and remote memory reads of work-groups in
6559 different agents a ``buffer_invl2`` is required. It will invalidate L2
6560 cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6561 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6562 coarse memory) cause local reads to be invalidated by remote writes with
6563 with the PTE C-bit so these cache lines are not invalidated. Note that
6564 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6565 never result in L2 cache lines that need to be invalidated.
6567 * PCIe access from the GPU to the CPU memory is kept coherent by using the
6568 MTYPE UC (uncached) which bypasses the L2.
6570 Scalar memory operations are only used to access memory that is proven to not
6571 change during the execution of the kernel dispatch. This includes constant
6572 address space and global address space for program scope ``const`` variables.
6573 Therefore, the kernel machine code does not have to maintain the scalar cache to
6574 ensure it is coherent with the vector caches. The scalar and vector caches are
6575 invalidated between kernel dispatches by CP since constant address space data
6576 may change between kernel dispatch executions. See
6577 :ref:`amdgpu-amdhsa-memory-spaces`.
6579 The one exception is if scalar writes are used to spill SGPR registers. In this
6580 case the AMDGPU backend ensures the memory location used to spill is never
6581 accessed by vector memory operations at the same time. If scalar writes are used
6582 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6583 return since the locations may be used for vector memory instructions by a
6584 future wavefront that uses the same scratch area, or a function call that
6585 creates a frame at the same address, respectively. There is no need for a
6586 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6588 For kernarg backing memory:
6590 * CP invalidates the L1 cache at the start of each kernel dispatch.
6591 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6592 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6593 cache. This also causes it to be treated as non-volatile and so is not
6594 invalidated by ``*_vol``.
6595 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6596 so the L2 cache will be coherent with the CPU and other agents.
6598 Scratch backing memory (which is used for the private address space) is accessed
6599 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6600 only accessed by a single thread, and is always write-before-read, there is
6601 never a need to invalidate these entries from the L1 cache. Hence all cache
6602 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6604 The code sequences used to implement the memory model for GFX90A are defined
6605 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6607 .. table:: AMDHSA Memory Model Code Sequences GFX90A
6608 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6610 ============ ============ ============== ========== ================================
6611 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
6612 Ordering Sync Scope Address GFX90A
6614 ============ ============ ============== ========== ================================
6616 ------------------------------------------------------------------------------------
6617 load *none* *none* - global - !volatile & !nontemporal
6619 - private 1. buffer/global/flat_load
6621 - !volatile & nontemporal
6623 1. buffer/global/flat_load
6628 1. buffer/global/flat_load
6630 2. s_waitcnt vmcnt(0)
6632 - Must happen before
6633 any following volatile
6644 load *none* *none* - local 1. ds_load
6645 store *none* *none* - global - !volatile & !nontemporal
6647 - private 1. buffer/global/flat_store
6649 - !volatile & nontemporal
6651 1. buffer/global/flat_store
6656 1. buffer/global/flat_store
6657 2. s_waitcnt vmcnt(0)
6659 - Must happen before
6660 any following volatile
6671 store *none* *none* - local 1. ds_store
6672 **Unordered Atomic**
6673 ------------------------------------------------------------------------------------
6674 load atomic unordered *any* *any* *Same as non-atomic*.
6675 store atomic unordered *any* *any* *Same as non-atomic*.
6676 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
6677 **Monotonic Atomic**
6678 ------------------------------------------------------------------------------------
6679 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
6680 - wavefront - generic
6681 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
6684 - If not TgSplit execution
6687 load atomic monotonic - singlethread - local *If TgSplit execution mode,
6688 - wavefront local address space cannot
6689 - workgroup be used.*
6692 load atomic monotonic - agent - global 1. buffer/global/flat_load
6694 load atomic monotonic - system - global 1. buffer/global/flat_load
6696 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
6697 - wavefront - generic
6700 store atomic monotonic - system - global 1. buffer/global/flat_store
6702 store atomic monotonic - singlethread - local *If TgSplit execution mode,
6703 - wavefront local address space cannot
6704 - workgroup be used.*
6707 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
6708 - wavefront - generic
6711 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
6713 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
6714 - wavefront local address space cannot
6715 - workgroup be used.*
6719 ------------------------------------------------------------------------------------
6720 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
6723 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
6725 - If not TgSplit execution
6728 2. s_waitcnt vmcnt(0)
6730 - If not TgSplit execution
6732 - Must happen before the
6733 following buffer_wbinvl1_vol.
6735 3. buffer_wbinvl1_vol
6737 - If not TgSplit execution
6739 - Must happen before
6750 load atomic acquire - workgroup - local *If TgSplit execution mode,
6751 local address space cannot
6755 2. s_waitcnt lgkmcnt(0)
6758 - Must happen before
6767 older than the local load
6771 load atomic acquire - workgroup - generic 1. flat_load glc=1
6773 - If not TgSplit execution
6776 2. s_waitcnt lgkm/vmcnt(0)
6778 - Use lgkmcnt(0) if not
6779 TgSplit execution mode
6780 and vmcnt(0) if TgSplit
6782 - If OpenCL, omit lgkmcnt(0).
6783 - Must happen before
6785 buffer_wbinvl1_vol and any
6786 following global/generic
6793 older than a local load
6797 3. buffer_wbinvl1_vol
6799 - If not TgSplit execution
6806 load atomic acquire - agent - global 1. buffer/global_load
6808 2. s_waitcnt vmcnt(0)
6810 - Must happen before
6818 3. buffer_wbinvl1_vol
6820 - Must happen before
6830 load atomic acquire - system - global 1. buffer/global/flat_load
6832 2. s_waitcnt vmcnt(0)
6834 - Must happen before
6835 following buffer_invl2 and
6845 - Must happen before
6853 stale L1 global data,
6854 nor see stale L2 MTYPE
6856 MTYPE RW and CC memory will
6857 never be stale in L2 due to
6860 load atomic acquire - agent - generic 1. flat_load glc=1
6861 2. s_waitcnt vmcnt(0) &
6864 - If TgSplit execution mode,
6868 - Must happen before
6871 - Ensures the flat_load
6876 3. buffer_wbinvl1_vol
6878 - Must happen before
6888 load atomic acquire - system - generic 1. flat_load glc=1
6889 2. s_waitcnt vmcnt(0) &
6892 - If TgSplit execution mode,
6896 - Must happen before
6900 - Ensures the flat_load
6908 - Must happen before
6916 stale L1 global data,
6917 nor see stale L2 MTYPE
6919 MTYPE RW and CC memory will
6920 never be stale in L2 due to
6923 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
6924 - wavefront - generic
6925 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
6926 - wavefront local address space cannot
6930 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
6931 2. s_waitcnt vmcnt(0)
6933 - If not TgSplit execution
6935 - Must happen before the
6936 following buffer_wbinvl1_vol.
6937 - Ensures the atomicrmw
6942 3. buffer_wbinvl1_vol
6944 - If not TgSplit execution
6946 - Must happen before
6956 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
6957 local address space cannot
6961 2. s_waitcnt lgkmcnt(0)
6964 - Must happen before
6973 older than the local
6977 atomicrmw acquire - workgroup - generic 1. flat_atomic
6978 2. s_waitcnt lgkm/vmcnt(0)
6980 - Use lgkmcnt(0) if not
6981 TgSplit execution mode
6982 and vmcnt(0) if TgSplit
6984 - If OpenCL, omit lgkmcnt(0).
6985 - Must happen before
6987 buffer_wbinvl1_vol and
7000 3. buffer_wbinvl1_vol
7002 - If not TgSplit execution
7009 atomicrmw acquire - agent - global 1. buffer/global_atomic
7010 2. s_waitcnt vmcnt(0)
7012 - Must happen before
7021 3. buffer_wbinvl1_vol
7023 - Must happen before
7033 atomicrmw acquire - system - global 1. buffer/global_atomic
7034 2. s_waitcnt vmcnt(0)
7036 - Must happen before
7037 following buffer_invl2 and
7048 - Must happen before
7056 stale L1 global data,
7057 nor see stale L2 MTYPE
7059 MTYPE RW and CC memory will
7060 never be stale in L2 due to
7063 atomicrmw acquire - agent - generic 1. flat_atomic
7064 2. s_waitcnt vmcnt(0) &
7067 - If TgSplit execution mode,
7071 - Must happen before
7080 3. buffer_wbinvl1_vol
7082 - Must happen before
7092 atomicrmw acquire - system - generic 1. flat_atomic
7093 2. s_waitcnt vmcnt(0) &
7096 - If TgSplit execution mode,
7100 - Must happen before
7113 - Must happen before
7121 stale L1 global data,
7122 nor see stale L2 MTYPE
7124 MTYPE RW and CC memory will
7125 never be stale in L2 due to
7128 fence acquire - singlethread *none* *none*
7130 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7132 - Use lgkmcnt(0) if not
7133 TgSplit execution mode
7134 and vmcnt(0) if TgSplit
7144 - However, since LLVM
7159 - s_waitcnt vmcnt(0)
7171 fence-paired-atomic).
7172 - s_waitcnt lgkmcnt(0)
7183 fence-paired-atomic).
7184 - Must happen before
7186 buffer_wbinvl1_vol and
7197 fence-paired-atomic.
7199 2. buffer_wbinvl1_vol
7201 - If not TgSplit execution
7208 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
7211 - If TgSplit execution mode,
7217 - However, since LLVM
7225 - Could be split into
7234 - s_waitcnt vmcnt(0)
7245 fence-paired-atomic).
7246 - s_waitcnt lgkmcnt(0)
7257 fence-paired-atomic).
7258 - Must happen before
7272 fence-paired-atomic.
7274 2. buffer_wbinvl1_vol
7276 - Must happen before any
7277 following global/generic
7286 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
7289 - If TgSplit execution mode,
7295 - However, since LLVM
7303 - Could be split into
7312 - s_waitcnt vmcnt(0)
7323 fence-paired-atomic).
7324 - s_waitcnt lgkmcnt(0)
7335 fence-paired-atomic).
7336 - Must happen before
7337 the following buffer_invl2 and
7350 fence-paired-atomic.
7355 - Must happen before any
7356 following global/generic
7363 stale L1 global data,
7364 nor see stale L2 MTYPE
7366 MTYPE RW and CC memory will
7367 never be stale in L2 due to
7370 ------------------------------------------------------------------------------------
7371 store atomic release - singlethread - global 1. buffer/global/flat_store
7372 - wavefront - generic
7373 store atomic release - singlethread - local *If TgSplit execution mode,
7374 - wavefront local address space cannot
7378 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7380 - Use lgkmcnt(0) if not
7381 TgSplit execution mode
7382 and vmcnt(0) if TgSplit
7384 - If OpenCL, omit lgkmcnt(0).
7385 - s_waitcnt vmcnt(0)
7388 global/generic load/store/
7389 load atomic/store atomic/
7391 - s_waitcnt lgkmcnt(0)
7398 - Must happen before
7409 2. buffer/global/flat_store
7410 store atomic release - workgroup - local *If TgSplit execution mode,
7411 local address space cannot
7415 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
7418 - If TgSplit execution mode,
7424 - Could be split into
7433 - s_waitcnt vmcnt(0)
7440 - s_waitcnt lgkmcnt(0)
7447 - Must happen before
7458 2. buffer/global/flat_store
7459 store atomic release - system - global 1. buffer_wbl2
7461 - Must happen before
7462 following s_waitcnt.
7463 - Performs L2 writeback to
7467 visible at system scope.
7469 2. s_waitcnt lgkmcnt(0) &
7472 - If TgSplit execution mode,
7478 - Could be split into
7487 - s_waitcnt vmcnt(0)
7488 must happen after any
7494 - s_waitcnt lgkmcnt(0)
7495 must happen after any
7501 - Must happen before
7506 to memory and the L2
7513 3. buffer/global/flat_store
7514 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
7515 - wavefront - generic
7516 atomicrmw release - singlethread - local *If TgSplit execution mode,
7517 - wavefront local address space cannot
7521 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7523 - Use lgkmcnt(0) if not
7524 TgSplit execution mode
7525 and vmcnt(0) if TgSplit
7529 - s_waitcnt vmcnt(0)
7532 global/generic load/store/
7533 load atomic/store atomic/
7535 - s_waitcnt lgkmcnt(0)
7542 - Must happen before
7553 2. buffer/global/flat_atomic
7554 atomicrmw release - workgroup - local *If TgSplit execution mode,
7555 local address space cannot
7559 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
7562 - If TgSplit execution mode,
7566 - Could be split into
7575 - s_waitcnt vmcnt(0)
7582 - s_waitcnt lgkmcnt(0)
7589 - Must happen before
7600 2. buffer/global/flat_atomic
7601 atomicrmw release - system - global 1. buffer_wbl2
7603 - Must happen before
7604 following s_waitcnt.
7605 - Performs L2 writeback to
7609 visible at system scope.
7611 2. s_waitcnt lgkmcnt(0) &
7614 - If TgSplit execution mode,
7618 - Could be split into
7627 - s_waitcnt vmcnt(0)
7634 - s_waitcnt lgkmcnt(0)
7641 - Must happen before
7646 to memory and the L2
7653 3. buffer/global/flat_atomic
7654 fence release - singlethread *none* *none*
7656 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7658 - Use lgkmcnt(0) if not
7659 TgSplit execution mode
7660 and vmcnt(0) if TgSplit
7670 - However, since LLVM
7685 - s_waitcnt vmcnt(0)
7690 load atomic/store atomic/
7692 - s_waitcnt lgkmcnt(0)
7699 - Must happen before
7708 fence-paired-atomic).
7715 fence-paired-atomic.
7717 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
7720 - If TgSplit execution mode,
7730 - However, since LLVM
7745 - Could be split into
7754 - s_waitcnt vmcnt(0)
7761 - s_waitcnt lgkmcnt(0)
7768 - Must happen before
7777 fence-paired-atomic).
7784 fence-paired-atomic.
7786 fence release - system *none* 1. buffer_wbl2
7791 - Must happen before
7792 following s_waitcnt.
7793 - Performs L2 writeback to
7797 visible at system scope.
7799 2. s_waitcnt lgkmcnt(0) &
7802 - If TgSplit execution mode,
7812 - However, since LLVM
7827 - Could be split into
7836 - s_waitcnt vmcnt(0)
7843 - s_waitcnt lgkmcnt(0)
7850 - Must happen before
7859 fence-paired-atomic).
7866 fence-paired-atomic.
7868 **Acquire-Release Atomic**
7869 ------------------------------------------------------------------------------------
7870 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
7871 - wavefront - generic
7872 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
7873 - wavefront local address space cannot
7877 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7879 - Use lgkmcnt(0) if not
7880 TgSplit execution mode
7881 and vmcnt(0) if TgSplit
7891 - s_waitcnt vmcnt(0)
7894 global/generic load/store/
7895 load atomic/store atomic/
7897 - s_waitcnt lgkmcnt(0)
7904 - Must happen before
7915 2. buffer/global_atomic
7916 3. s_waitcnt vmcnt(0)
7918 - If not TgSplit execution
7920 - Must happen before
7930 4. buffer_wbinvl1_vol
7932 - If not TgSplit execution
7939 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
7940 local address space cannot
7944 2. s_waitcnt lgkmcnt(0)
7947 - Must happen before
7956 older than the local load
7960 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
7962 - Use lgkmcnt(0) if not
7963 TgSplit execution mode
7964 and vmcnt(0) if TgSplit
7968 - s_waitcnt vmcnt(0)
7971 global/generic load/store/
7972 load atomic/store atomic/
7974 - s_waitcnt lgkmcnt(0)
7981 - Must happen before
7993 3. s_waitcnt lgkmcnt(0) &
7996 - If not TgSplit execution
7997 mode, omit vmcnt(0).
8000 - Must happen before
8002 buffer_wbinvl1_vol and
8011 older than a local load
8015 3. buffer_wbinvl1_vol
8017 - If not TgSplit execution
8024 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
8027 - If TgSplit execution mode,
8031 - Could be split into
8040 - s_waitcnt vmcnt(0)
8047 - s_waitcnt lgkmcnt(0)
8054 - Must happen before
8065 2. buffer/global_atomic
8066 3. s_waitcnt vmcnt(0)
8068 - Must happen before
8077 4. buffer_wbinvl1_vol
8079 - Must happen before
8089 atomicrmw acq_rel - system - global 1. buffer_wbl2
8091 - Must happen before
8092 following s_waitcnt.
8093 - Performs L2 writeback to
8097 visible at system scope.
8099 2. s_waitcnt lgkmcnt(0) &
8102 - If TgSplit execution mode,
8106 - Could be split into
8115 - s_waitcnt vmcnt(0)
8122 - s_waitcnt lgkmcnt(0)
8129 - Must happen before
8134 to global and L2 writeback
8135 have completed before
8140 3. buffer/global_atomic
8141 4. s_waitcnt vmcnt(0)
8143 - Must happen before
8144 following buffer_invl2 and
8155 - Must happen before
8163 stale L1 global data,
8164 nor see stale L2 MTYPE
8166 MTYPE RW and CC memory will
8167 never be stale in L2 due to
8170 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
8173 - If TgSplit execution mode,
8177 - Could be split into
8186 - s_waitcnt vmcnt(0)
8193 - s_waitcnt lgkmcnt(0)
8200 - Must happen before
8212 3. s_waitcnt vmcnt(0) &
8215 - If TgSplit execution mode,
8219 - Must happen before
8228 4. buffer_wbinvl1_vol
8230 - Must happen before
8240 atomicrmw acq_rel - system - generic 1. buffer_wbl2
8242 - Must happen before
8243 following s_waitcnt.
8244 - Performs L2 writeback to
8248 visible at system scope.
8250 2. s_waitcnt lgkmcnt(0) &
8253 - If TgSplit execution mode,
8257 - Could be split into
8266 - s_waitcnt vmcnt(0)
8273 - s_waitcnt lgkmcnt(0)
8280 - Must happen before
8285 to global and L2 writeback
8286 have completed before
8292 4. s_waitcnt vmcnt(0) &
8295 - If TgSplit execution mode,
8299 - Must happen before
8300 following buffer_invl2 and
8311 - Must happen before
8319 stale L1 global data,
8320 nor see stale L2 MTYPE
8322 MTYPE RW and CC memory will
8323 never be stale in L2 due to
8326 fence acq_rel - singlethread *none* *none*
8328 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
8330 - Use lgkmcnt(0) if not
8331 TgSplit execution mode
8332 and vmcnt(0) if TgSplit
8351 - s_waitcnt vmcnt(0)
8356 load atomic/store atomic/
8358 - s_waitcnt lgkmcnt(0)
8365 - Must happen before
8388 acquire-fence-paired-atomic)
8409 release-fence-paired-atomic).
8413 - Must happen before
8417 acquire-fence-paired
8418 atomic has completed
8427 acquire-fence-paired-atomic.
8429 2. buffer_wbinvl1_vol
8431 - If not TgSplit execution
8438 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
8441 - If TgSplit execution mode,
8447 - However, since LLVM
8455 - Could be split into
8464 - s_waitcnt vmcnt(0)
8471 - s_waitcnt lgkmcnt(0)
8478 - Must happen before
8483 global/local/generic
8492 acquire-fence-paired-atomic)
8504 global/local/generic
8513 release-fence-paired-atomic).
8518 2. buffer_wbinvl1_vol
8520 - Must happen before
8534 fence acq_rel - system *none* 1. buffer_wbl2
8539 - Must happen before
8540 following s_waitcnt.
8541 - Performs L2 writeback to
8545 visible at system scope.
8547 2. s_waitcnt lgkmcnt(0) &
8550 - If TgSplit execution mode,
8556 - However, since LLVM
8564 - Could be split into
8573 - s_waitcnt vmcnt(0)
8580 - s_waitcnt lgkmcnt(0)
8587 - Must happen before
8588 the following buffer_invl2 and
8592 global/local/generic
8601 acquire-fence-paired-atomic)
8613 global/local/generic
8622 release-fence-paired-atomic).
8630 - Must happen before
8639 stale L1 global data,
8640 nor see stale L2 MTYPE
8642 MTYPE RW and CC memory will
8643 never be stale in L2 due to
8646 **Sequential Consistent Atomic**
8647 ------------------------------------------------------------------------------------
8648 load atomic seq_cst - singlethread - global *Same as corresponding
8649 - wavefront - local load atomic acquire,
8650 - generic except must generate
8651 all instructions even
8653 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8655 - Use lgkmcnt(0) if not
8656 TgSplit execution mode
8657 and vmcnt(0) if TgSplit
8659 - s_waitcnt lgkmcnt(0) must
8672 lgkmcnt(0) and so do
8675 - s_waitcnt vmcnt(0)
8694 consistent global/local
8720 order. The s_waitcnt
8721 could be placed after
8725 make the s_waitcnt be
8732 instructions same as
8735 except must generate
8736 all instructions even
8738 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
8739 local address space cannot
8742 *Same as corresponding
8743 load atomic acquire,
8744 except must generate
8745 all instructions even
8748 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
8749 - system - generic vmcnt(0)
8751 - If TgSplit execution mode,
8753 - Could be split into
8762 - s_waitcnt lgkmcnt(0)
8775 lgkmcnt(0) and so do
8778 - s_waitcnt vmcnt(0)
8823 order. The s_waitcnt
8824 could be placed after
8828 make the s_waitcnt be
8835 instructions same as
8838 except must generate
8839 all instructions even
8841 store atomic seq_cst - singlethread - global *Same as corresponding
8842 - wavefront - local store atomic release,
8843 - workgroup - generic except must generate
8844 - agent all instructions even
8845 - system for OpenCL.*
8846 atomicrmw seq_cst - singlethread - global *Same as corresponding
8847 - wavefront - local atomicrmw acq_rel,
8848 - workgroup - generic except must generate
8849 - agent all instructions even
8850 - system for OpenCL.*
8851 fence seq_cst - singlethread *none* *Same as corresponding
8852 - wavefront fence acq_rel,
8853 - workgroup except must generate
8854 - agent all instructions even
8855 - system for OpenCL.*
8856 ============ ============ ============== ========== ================================
8858 .. _amdgpu-amdhsa-memory-model-gfx940:
8865 * Each agent has multiple shader arrays (SA).
8866 * Each SA has multiple compute units (CU).
8867 * Each CU has multiple SIMDs that execute wavefronts.
8868 * The wavefronts for a single work-group are executed in the same CU but may be
8869 executed by different SIMDs. The exception is when in tgsplit execution mode
8870 when the wavefronts may be executed by different SIMDs in different CUs.
8871 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
8872 executing on it. The exception is when in tgsplit execution mode when no LDS
8873 is allocated as wavefronts of the same work-group can be in different CUs.
8874 * All LDS operations of a CU are performed as wavefront wide operations in a
8875 global order and involve no caching. Completion is reported to a wavefront in
8877 * The LDS memory has multiple request queues shared by the SIMDs of a
8878 CU. Therefore, the LDS operations performed by different wavefronts of a
8879 work-group can be reordered relative to each other, which can result in
8880 reordering the visibility of vector memory operations with respect to LDS
8881 operations of other wavefronts in the same work-group. A ``s_waitcnt
8882 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8883 vector memory operations between wavefronts of a work-group, but not between
8884 operations performed by the same wavefront.
8885 * The vector memory operations are performed as wavefront wide operations and
8886 completion is reported to a wavefront in execution order. The exception is
8887 that ``flat_load/store/atomic`` instructions can report out of vector memory
8888 order if they access LDS memory, and out of LDS operation order if they access
8890 * The vector memory operations access a single vector L1 cache shared by all
8891 SIMDs a CU. Therefore:
8893 * No special action is required for coherence between the lanes of a single
8896 * No special action is required for coherence between wavefronts in the same
8897 work-group since they execute on the same CU. The exception is when in
8898 tgsplit execution mode as wavefronts of the same work-group can be in
8899 different CUs and so a ``buffer_inv sc0`` is required which will invalidate
8902 * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
8903 between wavefronts executing in different work-groups as they may be
8904 executing on different CUs.
8906 * Atomic read-modify-write instructions implicitly bypass the L1 cache.
8907 Therefore, they do not use the sc0 bit for coherence and instead use it to
8908 indicate if the instruction returns the original value being updated. They
8909 do use sc1 to indicate system or agent scope coherence.
8911 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
8912 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
8913 scalar operations are used in a restricted way so do not impact the memory
8914 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
8915 * The vector and scalar memory operations use an L2 cache.
8917 * The gfx940 can be configured as a number of smaller agents with each having
8918 a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
8919 larger agents with groups of CUs on each agent each sharing separate L2
8921 * The L2 cache has independent channels to service disjoint ranges of virtual
8923 * Each CU has a separate request queue per channel for its associated L2.
8924 Therefore, the vector and scalar memory operations performed by wavefronts
8925 executing with different L1 caches and the same L2 cache can be reordered
8926 relative to each other.
8927 * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
8928 vector memory operations of different CUs. It ensures a previous vector
8929 memory operation has completed before executing a subsequent vector memory
8930 or LDS operation and so can be used to meet the requirements of acquire and
8932 * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
8933 (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
8934 the PTE C-bit set for memory not local to the L2.
8936 * Any local memory cache lines will be automatically invalidated by writes
8937 from CUs associated with other L2 caches, or writes from the CPU, due to
8938 the cache probe caused by the PTE C-bit.
8939 * XGMI accesses from the CPU to local memory may be cached on the CPU.
8940 Subsequent access from the GPU will automatically invalidate or writeback
8941 the CPU cache due to the L2 probe filter.
8942 * To ensure coherence of local memory writes of CUs with different L1 caches
8943 in the same agent a ``buffer_wbl2`` is required. It does nothing if the
8944 agent is configured to have a single L2, or will writeback dirty L2 cache
8945 lines if configured to have multiple L2 caches.
8946 * To ensure coherence of local memory writes of CUs in different agents a
8947 ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
8948 * To ensure coherence of local memory reads of CUs with different L1 caches
8949 in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
8950 agent is configured to have a single L2, or will invalidate non-local L2
8951 cache lines if configured to have multiple L2 caches.
8952 * To ensure coherence of local memory reads of CUs in different agents a
8953 ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
8954 lines if configured to have multiple L2 caches.
8956 * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
8957 UC (uncached) which bypasses the L2.
8959 Scalar memory operations are only used to access memory that is proven to not
8960 change during the execution of the kernel dispatch. This includes constant
8961 address space and global address space for program scope ``const`` variables.
8962 Therefore, the kernel machine code does not have to maintain the scalar cache to
8963 ensure it is coherent with the vector caches. The scalar and vector caches are
8964 invalidated between kernel dispatches by CP since constant address space data
8965 may change between kernel dispatch executions. See
8966 :ref:`amdgpu-amdhsa-memory-spaces`.
8968 The one exception is if scalar writes are used to spill SGPR registers. In this
8969 case the AMDGPU backend ensures the memory location used to spill is never
8970 accessed by vector memory operations at the same time. If scalar writes are used
8971 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8972 return since the locations may be used for vector memory instructions by a
8973 future wavefront that uses the same scratch area, or a function call that
8974 creates a frame at the same address, respectively. There is no need for a
8975 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8977 For kernarg backing memory:
8979 * CP invalidates the L1 cache at the start of each kernel dispatch.
8980 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
8981 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
8982 cache. This also causes it to be treated as non-volatile and so is not
8983 invalidated by ``*_vol``.
8984 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8985 so the L2 cache will be coherent with the CPU and other agents.
8987 Scratch backing memory (which is used for the private address space) is accessed
8988 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
8989 only accessed by a single thread, and is always write-before-read, there is
8990 never a need to invalidate these entries from the L1 cache. Hence all cache
8991 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
8993 The code sequences used to implement the memory model for GFX940 are defined
8994 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`.
8996 .. table:: AMDHSA Memory Model Code Sequences GFX940
8997 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table
8999 ============ ============ ============== ========== ================================
9000 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
9001 Ordering Sync Scope Address GFX940
9003 ============ ============ ============== ========== ================================
9005 ------------------------------------------------------------------------------------
9006 load *none* *none* - global - !volatile & !nontemporal
9008 - private 1. buffer/global/flat_load
9010 - !volatile & nontemporal
9012 1. buffer/global/flat_load
9017 1. buffer/global/flat_load
9019 2. s_waitcnt vmcnt(0)
9021 - Must happen before
9022 any following volatile
9033 load *none* *none* - local 1. ds_load
9034 store *none* *none* - global - !volatile & !nontemporal
9036 - private 1. buffer/global/flat_store
9038 - !volatile & nontemporal
9040 1. buffer/global/flat_store
9045 1. buffer/global/flat_store
9047 2. s_waitcnt vmcnt(0)
9049 - Must happen before
9050 any following volatile
9061 store *none* *none* - local 1. ds_store
9062 **Unordered Atomic**
9063 ------------------------------------------------------------------------------------
9064 load atomic unordered *any* *any* *Same as non-atomic*.
9065 store atomic unordered *any* *any* *Same as non-atomic*.
9066 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
9067 **Monotonic Atomic**
9068 ------------------------------------------------------------------------------------
9069 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
9070 - wavefront - generic
9071 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
9073 load atomic monotonic - singlethread - local *If TgSplit execution mode,
9074 - wavefront local address space cannot
9075 - workgroup be used.*
9078 load atomic monotonic - agent - global 1. buffer/global/flat_load
9080 load atomic monotonic - system - global 1. buffer/global/flat_load
9081 - generic sc0=1 sc1=1
9082 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
9083 - wavefront - generic
9084 store atomic monotonic - workgroup - global 1. buffer/global/flat_store
9086 store atomic monotonic - agent - global 1. buffer/global/flat_store
9088 store atomic monotonic - system - global 1. buffer/global/flat_store
9089 - generic sc0=1 sc1=1
9090 store atomic monotonic - singlethread - local *If TgSplit execution mode,
9091 - wavefront local address space cannot
9092 - workgroup be used.*
9095 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
9096 - wavefront - generic
9099 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
9101 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
9102 - wavefront local address space cannot
9103 - workgroup be used.*
9107 ------------------------------------------------------------------------------------
9108 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
9111 load atomic acquire - workgroup - global 1. buffer/global_load sc0=1
9112 2. s_waitcnt vmcnt(0)
9114 - If not TgSplit execution
9116 - Must happen before the
9117 following buffer_inv.
9121 - If not TgSplit execution
9123 - Must happen before
9134 load atomic acquire - workgroup - local *If TgSplit execution mode,
9135 local address space cannot
9139 2. s_waitcnt lgkmcnt(0)
9142 - Must happen before
9151 older than the local load
9155 load atomic acquire - workgroup - generic 1. flat_load sc0=1
9156 2. s_waitcnt lgkm/vmcnt(0)
9158 - Use lgkmcnt(0) if not
9159 TgSplit execution mode
9160 and vmcnt(0) if TgSplit
9162 - If OpenCL, omit lgkmcnt(0).
9163 - Must happen before
9166 following global/generic
9173 older than a local load
9179 - If not TgSplit execution
9186 load atomic acquire - agent - global 1. buffer/global_load
9188 2. s_waitcnt vmcnt(0)
9190 - Must happen before
9200 - Must happen before
9210 load atomic acquire - system - global 1. buffer/global/flat_load
9212 2. s_waitcnt vmcnt(0)
9214 - Must happen before
9222 3. buffer_inv sc0=1 sc1=1
9224 - Must happen before
9232 stale MTYPE NC global data.
9233 MTYPE RW and CC memory will
9234 never be stale due to the
9237 load atomic acquire - agent - generic 1. flat_load sc1=1
9238 2. s_waitcnt vmcnt(0) &
9241 - If TgSplit execution mode,
9245 - Must happen before
9248 - Ensures the flat_load
9255 - Must happen before
9265 load atomic acquire - system - generic 1. flat_load sc0=1 sc1=1
9266 2. s_waitcnt vmcnt(0) &
9269 - If TgSplit execution mode,
9273 - Must happen before
9276 - Ensures the flat_load
9281 3. buffer_inv sc0=1 sc1=1
9283 - Must happen before
9291 stale MTYPE NC global data.
9292 MTYPE RW and CC memory will
9293 never be stale due to the
9296 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
9297 - wavefront - generic
9298 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
9299 - wavefront local address space cannot
9303 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
9304 2. s_waitcnt vmcnt(0)
9306 - If not TgSplit execution
9308 - Must happen before the
9309 following buffer_inv.
9310 - Ensures the atomicrmw
9317 - If not TgSplit execution
9319 - Must happen before
9329 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
9330 local address space cannot
9334 2. s_waitcnt lgkmcnt(0)
9337 - Must happen before
9346 older than the local
9350 atomicrmw acquire - workgroup - generic 1. flat_atomic
9351 2. s_waitcnt lgkm/vmcnt(0)
9353 - Use lgkmcnt(0) if not
9354 TgSplit execution mode
9355 and vmcnt(0) if TgSplit
9357 - If OpenCL, omit lgkmcnt(0).
9358 - Must happen before
9375 - If not TgSplit execution
9382 atomicrmw acquire - agent - global 1. buffer/global_atomic
9383 2. s_waitcnt vmcnt(0)
9385 - Must happen before
9396 - Must happen before
9406 atomicrmw acquire - system - global 1. buffer/global_atomic
9408 2. s_waitcnt vmcnt(0)
9410 - Must happen before
9419 3. buffer_inv sc0=1 sc1=1
9421 - Must happen before
9429 stale MTYPE NC global data.
9430 MTYPE RW and CC memory will
9431 never be stale due to the
9434 atomicrmw acquire - agent - generic 1. flat_atomic
9435 2. s_waitcnt vmcnt(0) &
9438 - If TgSplit execution mode,
9442 - Must happen before
9453 - Must happen before
9463 atomicrmw acquire - system - generic 1. flat_atomic sc1=1
9464 2. s_waitcnt vmcnt(0) &
9467 - If TgSplit execution mode,
9471 - Must happen before
9480 3. buffer_inv sc0=1 sc1=1
9482 - Must happen before
9490 stale MTYPE NC global data.
9491 MTYPE RW and CC memory will
9492 never be stale due to the
9495 fence acquire - singlethread *none* *none*
9497 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
9499 - Use lgkmcnt(0) if not
9500 TgSplit execution mode
9501 and vmcnt(0) if TgSplit
9511 - However, since LLVM
9526 - s_waitcnt vmcnt(0)
9538 fence-paired-atomic).
9539 - s_waitcnt lgkmcnt(0)
9550 fence-paired-atomic).
9551 - Must happen before
9564 fence-paired-atomic.
9568 - If not TgSplit execution
9575 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
9578 - If TgSplit execution mode,
9584 - However, since LLVM
9592 - Could be split into
9601 - s_waitcnt vmcnt(0)
9612 fence-paired-atomic).
9613 - s_waitcnt lgkmcnt(0)
9624 fence-paired-atomic).
9625 - Must happen before
9639 fence-paired-atomic.
9643 - Must happen before any
9644 following global/generic
9653 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
9656 - If TgSplit execution mode,
9662 - However, since LLVM
9670 - Could be split into
9679 - s_waitcnt vmcnt(0)
9690 fence-paired-atomic).
9691 - s_waitcnt lgkmcnt(0)
9702 fence-paired-atomic).
9703 - Must happen before
9717 fence-paired-atomic.
9719 2. buffer_inv sc0=1 sc1=1
9721 - Must happen before any
9722 following global/generic
9732 ------------------------------------------------------------------------------------
9733 store atomic release - singlethread - global 1. buffer/global/flat_store
9734 - wavefront - generic
9735 store atomic release - singlethread - local *If TgSplit execution mode,
9736 - wavefront local address space cannot
9740 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
9742 - Use lgkmcnt(0) if not
9743 TgSplit execution mode
9744 and vmcnt(0) if TgSplit
9746 - If OpenCL, omit lgkmcnt(0).
9747 - s_waitcnt vmcnt(0)
9750 global/generic load/store/
9751 load atomic/store atomic/
9753 - s_waitcnt lgkmcnt(0)
9760 - Must happen before
9771 2. buffer/global/flat_store sc0=1
9772 store atomic release - workgroup - local *If TgSplit execution mode,
9773 local address space cannot
9777 store atomic release - agent - global 1. buffer_wbl2 sc1=1
9779 - Must happen before
9780 following s_waitcnt.
9781 - Performs L2 writeback to
9785 visible at agent scope.
9787 2. s_waitcnt lgkmcnt(0) &
9790 - If TgSplit execution mode,
9796 - Could be split into
9805 - s_waitcnt vmcnt(0)
9812 - s_waitcnt lgkmcnt(0)
9819 - Must happen before
9830 3. buffer/global/flat_store sc1=1
9831 store atomic release - system - global 1. buffer_wbl2 sc0=1 sc1=1
9833 - Must happen before
9834 following s_waitcnt.
9835 - Performs L2 writeback to
9839 visible at system scope.
9841 2. s_waitcnt lgkmcnt(0) &
9844 - If TgSplit execution mode,
9850 - Could be split into
9859 - s_waitcnt vmcnt(0)
9860 must happen after any
9866 - s_waitcnt lgkmcnt(0)
9867 must happen after any
9873 - Must happen before
9878 to memory and the L2
9885 3. buffer/global/flat_store
9887 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
9888 - wavefront - generic
9889 atomicrmw release - singlethread - local *If TgSplit execution mode,
9890 - wavefront local address space cannot
9894 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
9896 - Use lgkmcnt(0) if not
9897 TgSplit execution mode
9898 and vmcnt(0) if TgSplit
9902 - s_waitcnt vmcnt(0)
9905 global/generic load/store/
9906 load atomic/store atomic/
9908 - s_waitcnt lgkmcnt(0)
9915 - Must happen before
9926 2. buffer/global/flat_atomic sc0=1
9927 atomicrmw release - workgroup - local *If TgSplit execution mode,
9928 local address space cannot
9932 atomicrmw release - agent - global 1. buffer_wbl2 sc1=1
9934 - Must happen before
9935 following s_waitcnt.
9936 - Performs L2 writeback to
9940 visible at agent scope.
9942 2. s_waitcnt lgkmcnt(0) &
9945 - If TgSplit execution mode,
9949 - Could be split into
9958 - s_waitcnt vmcnt(0)
9965 - s_waitcnt lgkmcnt(0)
9972 - Must happen before
9983 3. buffer/global/flat_atomic sc1=1
9984 atomicrmw release - system - global 1. buffer_wbl2 sc0=1 sc1=1
9986 - Must happen before
9987 following s_waitcnt.
9988 - Performs L2 writeback to
9992 visible at system scope.
9994 2. s_waitcnt lgkmcnt(0) &
9997 - If TgSplit execution mode,
10001 - Could be split into
10005 lgkmcnt(0) to allow
10007 independently moved
10010 - s_waitcnt vmcnt(0)
10017 - s_waitcnt lgkmcnt(0)
10024 - Must happen before
10029 to memory and the L2
10033 store that is being
10036 3. buffer/global/flat_atomic
10038 fence release - singlethread *none* *none*
10040 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
10042 - Use lgkmcnt(0) if not
10043 TgSplit execution mode
10044 and vmcnt(0) if TgSplit
10054 - However, since LLVM
10059 always generate. If
10069 - s_waitcnt vmcnt(0)
10074 load atomic/store atomic/
10076 - s_waitcnt lgkmcnt(0)
10083 - Must happen before
10084 any following store
10088 and memory ordering
10092 fence-paired-atomic).
10099 fence-paired-atomic.
10101 fence release - agent *none* 1. buffer_wbl2 sc1=1
10106 - Must happen before
10107 following s_waitcnt.
10108 - Performs L2 writeback to
10111 store/atomicrmw are
10112 visible at agent scope.
10114 2. s_waitcnt lgkmcnt(0) &
10117 - If TgSplit execution mode,
10127 - However, since LLVM
10132 always generate. If
10142 - Could be split into
10146 lgkmcnt(0) to allow
10148 independently moved
10151 - s_waitcnt vmcnt(0)
10158 - s_waitcnt lgkmcnt(0)
10165 - Must happen before
10166 any following store
10170 and memory ordering
10174 fence-paired-atomic).
10181 fence-paired-atomic.
10183 fence release - system *none* 1. buffer_wbl2 sc0=1 sc1=1
10185 - Must happen before
10186 following s_waitcnt.
10187 - Performs L2 writeback to
10190 store/atomicrmw are
10191 visible at system scope.
10193 2. s_waitcnt lgkmcnt(0) &
10196 - If TgSplit execution mode,
10206 - However, since LLVM
10211 always generate. If
10221 - Could be split into
10225 lgkmcnt(0) to allow
10227 independently moved
10230 - s_waitcnt vmcnt(0)
10237 - s_waitcnt lgkmcnt(0)
10244 - Must happen before
10245 any following store
10249 and memory ordering
10253 fence-paired-atomic).
10260 fence-paired-atomic.
10262 **Acquire-Release Atomic**
10263 ------------------------------------------------------------------------------------
10264 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
10265 - wavefront - generic
10266 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
10267 - wavefront local address space cannot
10271 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10273 - Use lgkmcnt(0) if not
10274 TgSplit execution mode
10275 and vmcnt(0) if TgSplit
10279 - Must happen after
10285 - s_waitcnt vmcnt(0)
10288 global/generic load/store/
10289 load atomic/store atomic/
10291 - s_waitcnt lgkmcnt(0)
10298 - Must happen before
10309 2. buffer/global_atomic
10310 3. s_waitcnt vmcnt(0)
10312 - If not TgSplit execution
10314 - Must happen before
10324 4. buffer_inv sc0=1
10326 - If not TgSplit execution
10333 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
10334 local address space cannot
10338 2. s_waitcnt lgkmcnt(0)
10341 - Must happen before
10350 older than the local load
10354 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
10356 - Use lgkmcnt(0) if not
10357 TgSplit execution mode
10358 and vmcnt(0) if TgSplit
10362 - s_waitcnt vmcnt(0)
10365 global/generic load/store/
10366 load atomic/store atomic/
10368 - s_waitcnt lgkmcnt(0)
10375 - Must happen before
10387 3. s_waitcnt lgkmcnt(0) &
10390 - If not TgSplit execution
10391 mode, omit vmcnt(0).
10394 - Must happen before
10405 older than a local load
10409 3. buffer_inv sc0=1
10411 - If not TgSplit execution
10418 atomicrmw acq_rel - agent - global 1. buffer_wbl2 sc1=1
10420 - Must happen before
10421 following s_waitcnt.
10422 - Performs L2 writeback to
10425 store/atomicrmw are
10426 visible at agent scope.
10428 2. s_waitcnt lgkmcnt(0) &
10431 - If TgSplit execution mode,
10435 - Could be split into
10439 lgkmcnt(0) to allow
10441 independently moved
10444 - s_waitcnt vmcnt(0)
10451 - s_waitcnt lgkmcnt(0)
10458 - Must happen before
10469 3. buffer/global_atomic
10470 4. s_waitcnt vmcnt(0)
10472 - Must happen before
10481 5. buffer_inv sc1=1
10483 - Must happen before
10493 atomicrmw acq_rel - system - global 1. buffer_wbl2 sc0=1 sc1=1
10495 - Must happen before
10496 following s_waitcnt.
10497 - Performs L2 writeback to
10500 store/atomicrmw are
10501 visible at system scope.
10503 2. s_waitcnt lgkmcnt(0) &
10506 - If TgSplit execution mode,
10510 - Could be split into
10514 lgkmcnt(0) to allow
10516 independently moved
10519 - s_waitcnt vmcnt(0)
10526 - s_waitcnt lgkmcnt(0)
10533 - Must happen before
10538 to global and L2 writeback
10539 have completed before
10544 3. buffer/global_atomic
10546 4. s_waitcnt vmcnt(0)
10548 - Must happen before
10557 5. buffer_inv sc0=1 sc1=1
10559 - Must happen before
10567 MTYPE NC global data.
10568 MTYPE RW and CC memory will
10569 never be stale due to the
10572 atomicrmw acq_rel - agent - generic 1. buffer_wbl2 sc1=1
10574 - Must happen before
10575 following s_waitcnt.
10576 - Performs L2 writeback to
10579 store/atomicrmw are
10580 visible at agent scope.
10582 2. s_waitcnt lgkmcnt(0) &
10585 - If TgSplit execution mode,
10589 - Could be split into
10593 lgkmcnt(0) to allow
10595 independently moved
10598 - s_waitcnt vmcnt(0)
10605 - s_waitcnt lgkmcnt(0)
10612 - Must happen before
10624 4. s_waitcnt vmcnt(0) &
10627 - If TgSplit execution mode,
10631 - Must happen before
10640 5. buffer_inv sc1=1
10642 - Must happen before
10652 atomicrmw acq_rel - system - generic 1. buffer_wbl2 sc0=1 sc1=1
10654 - Must happen before
10655 following s_waitcnt.
10656 - Performs L2 writeback to
10659 store/atomicrmw are
10660 visible at system scope.
10662 2. s_waitcnt lgkmcnt(0) &
10665 - If TgSplit execution mode,
10669 - Could be split into
10673 lgkmcnt(0) to allow
10675 independently moved
10678 - s_waitcnt vmcnt(0)
10685 - s_waitcnt lgkmcnt(0)
10692 - Must happen before
10697 to global and L2 writeback
10698 have completed before
10703 3. flat_atomic sc1=1
10704 4. s_waitcnt vmcnt(0) &
10707 - If TgSplit execution mode,
10711 - Must happen before
10720 5. buffer_inv sc0=1 sc1=1
10722 - Must happen before
10730 MTYPE NC global data.
10731 MTYPE RW and CC memory will
10732 never be stale due to the
10735 fence acq_rel - singlethread *none* *none*
10737 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
10739 - Use lgkmcnt(0) if not
10740 TgSplit execution mode
10741 and vmcnt(0) if TgSplit
10760 - s_waitcnt vmcnt(0)
10765 load atomic/store atomic/
10767 - s_waitcnt lgkmcnt(0)
10774 - Must happen before
10793 and memory ordering
10797 acquire-fence-paired-atomic)
10810 local/generic store
10814 and memory ordering
10818 release-fence-paired-atomic).
10822 - Must happen before
10826 acquire-fence-paired
10827 atomic has completed
10828 before invalidating
10832 locations read must
10836 acquire-fence-paired-atomic.
10838 3. buffer_inv sc0=1
10840 - If not TgSplit execution
10847 fence acq_rel - agent *none* 1. buffer_wbl2 sc1=1
10852 - Must happen before
10853 following s_waitcnt.
10854 - Performs L2 writeback to
10857 store/atomicrmw are
10858 visible at agent scope.
10860 2. s_waitcnt lgkmcnt(0) &
10863 - If TgSplit execution mode,
10869 - However, since LLVM
10877 - Could be split into
10881 lgkmcnt(0) to allow
10883 independently moved
10886 - s_waitcnt vmcnt(0)
10893 - s_waitcnt lgkmcnt(0)
10900 - Must happen before
10905 global/local/generic
10910 and memory ordering
10914 acquire-fence-paired-atomic)
10916 before invalidating
10926 global/local/generic
10931 and memory ordering
10935 release-fence-paired-atomic).
10940 3. buffer_inv sc1=1
10942 - Must happen before
10956 fence acq_rel - system *none* 1. buffer_wbl2 sc0=1 sc1=1
10961 - Must happen before
10962 following s_waitcnt.
10963 - Performs L2 writeback to
10966 store/atomicrmw are
10967 visible at system scope.
10969 1. s_waitcnt lgkmcnt(0) &
10972 - If TgSplit execution mode,
10978 - However, since LLVM
10986 - Could be split into
10990 lgkmcnt(0) to allow
10992 independently moved
10995 - s_waitcnt vmcnt(0)
11002 - s_waitcnt lgkmcnt(0)
11009 - Must happen before
11014 global/local/generic
11019 and memory ordering
11023 acquire-fence-paired-atomic)
11025 before invalidating
11035 global/local/generic
11040 and memory ordering
11044 release-fence-paired-atomic).
11049 2. buffer_inv sc0=1 sc1=1
11051 - Must happen before
11060 MTYPE NC global data.
11061 MTYPE RW and CC memory will
11062 never be stale due to the
11065 **Sequential Consistent Atomic**
11066 ------------------------------------------------------------------------------------
11067 load atomic seq_cst - singlethread - global *Same as corresponding
11068 - wavefront - local load atomic acquire,
11069 - generic except must generate
11070 all instructions even
11072 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
11074 - Use lgkmcnt(0) if not
11075 TgSplit execution mode
11076 and vmcnt(0) if TgSplit
11078 - s_waitcnt lgkmcnt(0) must
11085 ordering of seq_cst
11091 lgkmcnt(0) and so do
11094 - s_waitcnt vmcnt(0)
11097 global/generic load
11101 ordering of seq_cst
11113 consistent global/local
11114 memory instructions
11120 prevents reordering
11123 seq_cst load. (Note
11129 followed by a store
11136 release followed by
11139 order. The s_waitcnt
11140 could be placed after
11141 seq_store or before
11144 make the s_waitcnt be
11145 as late as possible
11151 instructions same as
11154 except must generate
11155 all instructions even
11157 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
11158 local address space cannot
11161 *Same as corresponding
11162 load atomic acquire,
11163 except must generate
11164 all instructions even
11167 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
11168 - system - generic vmcnt(0)
11170 - If TgSplit execution mode,
11172 - Could be split into
11176 lgkmcnt(0) to allow
11178 independently moved
11181 - s_waitcnt lgkmcnt(0)
11184 global/generic load
11188 ordering of seq_cst
11194 lgkmcnt(0) and so do
11197 - s_waitcnt vmcnt(0)
11200 global/generic load
11204 ordering of seq_cst
11217 memory instructions
11223 prevents reordering
11226 seq_cst load. (Note
11232 followed by a store
11239 release followed by
11242 order. The s_waitcnt
11243 could be placed after
11244 seq_store or before
11247 make the s_waitcnt be
11248 as late as possible
11254 instructions same as
11257 except must generate
11258 all instructions even
11260 store atomic seq_cst - singlethread - global *Same as corresponding
11261 - wavefront - local store atomic release,
11262 - workgroup - generic except must generate
11263 - agent all instructions even
11264 - system for OpenCL.*
11265 atomicrmw seq_cst - singlethread - global *Same as corresponding
11266 - wavefront - local atomicrmw acq_rel,
11267 - workgroup - generic except must generate
11268 - agent all instructions even
11269 - system for OpenCL.*
11270 fence seq_cst - singlethread *none* *Same as corresponding
11271 - wavefront fence acq_rel,
11272 - workgroup except must generate
11273 - agent all instructions even
11274 - system for OpenCL.*
11275 ============ ============ ============== ========== ================================
11277 .. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
11279 Memory Model GFX10-GFX11
11280 ++++++++++++++++++++++++
11284 * Each agent has multiple shader arrays (SA).
11285 * Each SA has multiple work-group processors (WGP).
11286 * Each WGP has multiple compute units (CU).
11287 * Each CU has multiple SIMDs that execute wavefronts.
11288 * The wavefronts for a single work-group are executed in the same
11289 WGP. In CU wavefront execution mode the wavefronts may be executed by
11290 different SIMDs in the same CU. In WGP wavefront execution mode the
11291 wavefronts may be executed by different SIMDs in different CUs in the same
11293 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
11295 * All LDS operations of a WGP are performed as wavefront wide operations in a
11296 global order and involve no caching. Completion is reported to a wavefront in
11298 * The LDS memory has multiple request queues shared by the SIMDs of a
11299 WGP. Therefore, the LDS operations performed by different wavefronts of a
11300 work-group can be reordered relative to each other, which can result in
11301 reordering the visibility of vector memory operations with respect to LDS
11302 operations of other wavefronts in the same work-group. A ``s_waitcnt
11303 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
11304 vector memory operations between wavefronts of a work-group, but not between
11305 operations performed by the same wavefront.
11306 * The vector memory operations are performed as wavefront wide operations.
11307 Completion of load/store/sample operations are reported to a wavefront in
11308 execution order of other load/store/sample operations performed by that
11310 * The vector memory operations access a vector L0 cache. There is a single L0
11311 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
11312 special action is required for coherence between the lanes of a single
11313 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
11314 wavefronts executing in the same work-group as they may be executing on SIMDs
11315 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
11316 required for coherence between wavefronts executing in different work-groups
11317 as they may be executing on different WGPs.
11318 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
11319 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
11320 operations are used in a restricted way so do not impact the memory model. See
11321 :ref:`amdgpu-amdhsa-memory-spaces`.
11322 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
11323 the same SA. Therefore, no special action is required for coherence between
11324 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
11325 required for coherence between wavefronts executing in different work-groups
11326 as they may be executing on different SAs that access different L1s.
11327 * The L1 caches have independent quadrants to service disjoint ranges of virtual
11329 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
11330 vector and scalar memory operations performed by different wavefronts, whether
11331 executing in the same or different work-groups (which may be executing on
11332 different CUs accessing different L0s), can be reordered relative to each
11333 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
11334 synchronization between vector memory operations of different wavefronts. It
11335 ensures a previous vector memory operation has completed before executing a
11336 subsequent vector memory or LDS operation and so can be used to meet the
11337 requirements of acquire, release and sequential consistency.
11338 * The L1 caches use an L2 cache shared by all SAs on the same agent.
11339 * The L2 cache has independent channels to service disjoint ranges of virtual
11341 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
11342 quadrant has a separate request queue per L2 channel. Therefore, the vector
11343 and scalar memory operations performed by wavefronts executing in different
11344 work-groups (which may be executing on different SAs) of an agent can be
11345 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
11346 required to ensure synchronization between vector memory operations of
11347 different SAs. It ensures a previous vector memory operation has completed
11348 before executing a subsequent vector memory and so can be used to meet the
11349 requirements of acquire, release and sequential consistency.
11350 * The L2 cache can be kept coherent with other agents on some targets, or ranges
11351 of virtual addresses can be set up to bypass it to ensure system coherence.
11352 * On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
11353 The MALL cache is fully coherent with GPU memory and has no impact on system
11354 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
11356 Scalar memory operations are only used to access memory that is proven to not
11357 change during the execution of the kernel dispatch. This includes constant
11358 address space and global address space for program scope ``const`` variables.
11359 Therefore, the kernel machine code does not have to maintain the scalar cache to
11360 ensure it is coherent with the vector caches. The scalar and vector caches are
11361 invalidated between kernel dispatches by CP since constant address space data
11362 may change between kernel dispatch executions. See
11363 :ref:`amdgpu-amdhsa-memory-spaces`.
11365 The one exception is if scalar writes are used to spill SGPR registers. In this
11366 case the AMDGPU backend ensures the memory location used to spill is never
11367 accessed by vector memory operations at the same time. If scalar writes are used
11368 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
11369 return since the locations may be used for vector memory instructions by a
11370 future wavefront that uses the same scratch area, or a function call that
11371 creates a frame at the same address, respectively. There is no need for a
11372 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
11374 For kernarg backing memory:
11376 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
11377 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
11378 needing to invalidate the L2 cache.
11379 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
11380 so the L2 cache will be coherent with the CPU and other agents.
11382 Scratch backing memory (which is used for the private address space) is accessed
11383 with MTYPE NC (non-coherent). Since the private address space is only accessed
11384 by a single thread, and is always write-before-read, there is never a need to
11385 invalidate these entries from the L0 or L1 caches.
11387 Wavefronts are executed in native mode with in-order reporting of loads and
11388 sample instructions. In this mode vmcnt reports completion of load, atomic with
11389 return and sample instructions in order, and the vscnt reports the completion of
11390 store and atomic without return in order. See ``MEM_ORDERED`` field in
11391 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
11393 Wavefronts can be executed in WGP or CU wavefront execution mode:
11395 * In WGP wavefront execution mode the wavefronts of a work-group are executed
11396 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
11397 CU L0 caches is required for work-group synchronization. Also accesses to L1
11398 at work-group scope need to be explicitly ordered as the accesses from
11399 different CUs are not ordered.
11400 * In CU wavefront execution mode the wavefronts of a work-group are executed on
11401 the SIMDs of a single CU of the WGP. Therefore, all global memory access by
11402 the work-group access the same L0 which in turn ensures L1 accesses are
11403 ordered and so do not require explicit management of the caches for
11404 work-group synchronization.
11406 See ``WGP_MODE`` field in
11407 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table` and
11408 :ref:`amdgpu-target-features`.
11410 The code sequences used to implement the memory model for GFX10-GFX11 are defined in
11411 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
11413 .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
11414 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
11416 ============ ============ ============== ========== ================================
11417 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
11418 Ordering Sync Scope Address GFX10-GFX11
11420 ============ ============ ============== ========== ================================
11422 ------------------------------------------------------------------------------------
11423 load *none* *none* - global - !volatile & !nontemporal
11425 - private 1. buffer/global/flat_load
11427 - !volatile & nontemporal
11429 1. buffer/global/flat_load
11432 - If GFX10, omit dlc=1.
11436 1. buffer/global/flat_load
11439 2. s_waitcnt vmcnt(0)
11441 - Must happen before
11442 any following volatile
11453 load *none* *none* - local 1. ds_load
11454 store *none* *none* - global - !volatile & !nontemporal
11456 - private 1. buffer/global/flat_store
11458 - !volatile & nontemporal
11460 1. buffer/global/flat_store
11463 - If GFX10, omit dlc=1.
11467 1. buffer/global/flat_store
11470 - If GFX10, omit dlc=1.
11472 2. s_waitcnt vscnt(0)
11474 - Must happen before
11475 any following volatile
11486 store *none* *none* - local 1. ds_store
11487 **Unordered Atomic**
11488 ------------------------------------------------------------------------------------
11489 load atomic unordered *any* *any* *Same as non-atomic*.
11490 store atomic unordered *any* *any* *Same as non-atomic*.
11491 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
11492 **Monotonic Atomic**
11493 ------------------------------------------------------------------------------------
11494 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
11495 - wavefront - generic
11496 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
11499 - If CU wavefront execution
11502 load atomic monotonic - singlethread - local 1. ds_load
11505 load atomic monotonic - agent - global 1. buffer/global/flat_load
11506 - system - generic glc=1 dlc=1
11508 - If GFX11, omit dlc=1.
11510 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
11511 - wavefront - generic
11515 store atomic monotonic - singlethread - local 1. ds_store
11518 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
11519 - wavefront - generic
11523 atomicrmw monotonic - singlethread - local 1. ds_atomic
11527 ------------------------------------------------------------------------------------
11528 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
11529 - wavefront - local
11531 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
11533 - If CU wavefront execution
11536 2. s_waitcnt vmcnt(0)
11538 - If CU wavefront execution
11540 - Must happen before
11541 the following buffer_gl0_inv
11542 and before any following
11550 - If CU wavefront execution
11557 load atomic acquire - workgroup - local 1. ds_load
11558 2. s_waitcnt lgkmcnt(0)
11561 - Must happen before
11562 the following buffer_gl0_inv
11563 and before any following
11564 global/generic load/load
11570 older than the local load
11576 - If CU wavefront execution
11584 load atomic acquire - workgroup - generic 1. flat_load glc=1
11586 - If CU wavefront execution
11589 2. s_waitcnt lgkmcnt(0) &
11592 - If CU wavefront execution
11593 mode, omit vmcnt(0).
11596 - Must happen before
11598 buffer_gl0_inv and any
11599 following global/generic
11606 older than a local load
11612 - If CU wavefront execution
11619 load atomic acquire - agent - global 1. buffer/global_load
11620 - system glc=1 dlc=1
11622 - If GFX11, omit dlc=1.
11624 2. s_waitcnt vmcnt(0)
11626 - Must happen before
11631 before invalidating
11637 - Must happen before
11647 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1
11649 - If GFX11, omit dlc=1.
11651 2. s_waitcnt vmcnt(0) &
11656 - Must happen before
11659 - Ensures the flat_load
11661 before invalidating
11667 - Must happen before
11677 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
11678 - wavefront - local
11680 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
11681 2. s_waitcnt vm/vscnt(0)
11683 - If CU wavefront execution
11685 - Use vmcnt(0) if atomic with
11686 return and vscnt(0) if
11687 atomic with no-return.
11688 - Must happen before
11689 the following buffer_gl0_inv
11690 and before any following
11698 - If CU wavefront execution
11705 atomicrmw acquire - workgroup - local 1. ds_atomic
11706 2. s_waitcnt lgkmcnt(0)
11709 - Must happen before
11715 older than the local
11727 atomicrmw acquire - workgroup - generic 1. flat_atomic
11728 2. s_waitcnt lgkmcnt(0) &
11731 - If CU wavefront execution
11732 mode, omit vm/vscnt(0).
11733 - If OpenCL, omit lgkmcnt(0).
11734 - Use vmcnt(0) if atomic with
11735 return and vscnt(0) if
11736 atomic with no-return.
11737 - Must happen before
11749 - If CU wavefront execution
11756 atomicrmw acquire - agent - global 1. buffer/global_atomic
11757 - system 2. s_waitcnt vm/vscnt(0)
11759 - Use vmcnt(0) if atomic with
11760 return and vscnt(0) if
11761 atomic with no-return.
11762 - Must happen before
11774 - Must happen before
11784 atomicrmw acquire - agent - generic 1. flat_atomic
11785 - system 2. s_waitcnt vm/vscnt(0) &
11790 - Use vmcnt(0) if atomic with
11791 return and vscnt(0) if
11792 atomic with no-return.
11793 - Must happen before
11805 - Must happen before
11815 fence acquire - singlethread *none* *none*
11817 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
11818 vmcnt(0) & vscnt(0)
11820 - If CU wavefront execution
11821 mode, omit vmcnt(0) and
11830 vmcnt(0) and vscnt(0).
11831 - However, since LLVM
11836 always generate. If
11846 - Could be split into
11848 vmcnt(0), s_waitcnt
11849 vscnt(0) and s_waitcnt
11850 lgkmcnt(0) to allow
11852 independently moved
11855 - s_waitcnt vmcnt(0)
11858 global/generic load
11860 atomicrmw-with-return-value
11863 and memory ordering
11867 fence-paired-atomic).
11868 - s_waitcnt vscnt(0)
11872 atomicrmw-no-return-value
11875 and memory ordering
11879 fence-paired-atomic).
11880 - s_waitcnt lgkmcnt(0)
11887 and memory ordering
11891 fence-paired-atomic).
11892 - Must happen before
11896 fence-paired atomic
11898 before invalidating
11902 locations read must
11906 fence-paired-atomic.
11910 - If CU wavefront execution
11917 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
11918 - system vmcnt(0) & vscnt(0)
11927 vmcnt(0) and vscnt(0).
11928 - However, since LLVM
11936 - Could be split into
11938 vmcnt(0), s_waitcnt
11939 vscnt(0) and s_waitcnt
11940 lgkmcnt(0) to allow
11942 independently moved
11945 - s_waitcnt vmcnt(0)
11948 global/generic load
11950 atomicrmw-with-return-value
11953 and memory ordering
11957 fence-paired-atomic).
11958 - s_waitcnt vscnt(0)
11962 atomicrmw-no-return-value
11965 and memory ordering
11969 fence-paired-atomic).
11970 - s_waitcnt lgkmcnt(0)
11977 and memory ordering
11981 fence-paired-atomic).
11982 - Must happen before
11986 fence-paired atomic
11988 before invalidating
11992 locations read must
11996 fence-paired-atomic.
12001 - Must happen before any
12002 following global/generic
12012 ------------------------------------------------------------------------------------
12013 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
12014 - wavefront - local
12016 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12017 - generic vmcnt(0) & vscnt(0)
12019 - If CU wavefront execution
12020 mode, omit vmcnt(0) and
12024 - Could be split into
12026 vmcnt(0), s_waitcnt
12027 vscnt(0) and s_waitcnt
12028 lgkmcnt(0) to allow
12030 independently moved
12033 - s_waitcnt vmcnt(0)
12036 global/generic load/load
12038 atomicrmw-with-return-value.
12039 - s_waitcnt vscnt(0)
12045 atomicrmw-no-return-value.
12046 - s_waitcnt lgkmcnt(0)
12053 - Must happen before
12061 store that is being
12064 2. buffer/global/flat_store
12065 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12067 - If CU wavefront execution
12070 - Could be split into
12072 vmcnt(0) and s_waitcnt
12075 independently moved
12078 - s_waitcnt vmcnt(0)
12081 global/generic load/load
12083 atomicrmw-with-return-value.
12084 - s_waitcnt vscnt(0)
12088 store/store atomic/
12089 atomicrmw-no-return-value.
12090 - Must happen before
12098 store that is being
12102 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
12103 - system - generic vmcnt(0) & vscnt(0)
12109 - Could be split into
12111 vmcnt(0), s_waitcnt vscnt(0)
12113 lgkmcnt(0) to allow
12115 independently moved
12118 - s_waitcnt vmcnt(0)
12124 atomicrmw-with-return-value.
12125 - s_waitcnt vscnt(0)
12129 store/store atomic/
12130 atomicrmw-no-return-value.
12131 - s_waitcnt lgkmcnt(0)
12138 - Must happen before
12146 store that is being
12149 2. buffer/global/flat_store
12150 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
12151 - wavefront - local
12153 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12154 - generic vmcnt(0) & vscnt(0)
12156 - If CU wavefront execution
12157 mode, omit vmcnt(0) and
12159 - If OpenCL, omit lgkmcnt(0).
12160 - Could be split into
12162 vmcnt(0), s_waitcnt
12163 vscnt(0) and s_waitcnt
12164 lgkmcnt(0) to allow
12166 independently moved
12169 - s_waitcnt vmcnt(0)
12172 global/generic load/load
12174 atomicrmw-with-return-value.
12175 - s_waitcnt vscnt(0)
12181 atomicrmw-no-return-value.
12182 - s_waitcnt lgkmcnt(0)
12189 - Must happen before
12200 2. buffer/global/flat_atomic
12201 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12203 - If CU wavefront execution
12206 - Could be split into
12208 vmcnt(0) and s_waitcnt
12211 independently moved
12214 - s_waitcnt vmcnt(0)
12217 global/generic load/load
12219 atomicrmw-with-return-value.
12220 - s_waitcnt vscnt(0)
12224 store/store atomic/
12225 atomicrmw-no-return-value.
12226 - Must happen before
12234 store that is being
12238 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
12239 - system - generic vmcnt(0) & vscnt(0)
12243 - Could be split into
12245 vmcnt(0), s_waitcnt
12246 vscnt(0) and s_waitcnt
12247 lgkmcnt(0) to allow
12249 independently moved
12252 - s_waitcnt vmcnt(0)
12257 atomicrmw-with-return-value.
12258 - s_waitcnt vscnt(0)
12262 store/store atomic/
12263 atomicrmw-no-return-value.
12264 - s_waitcnt lgkmcnt(0)
12271 - Must happen before
12276 to global and local
12282 2. buffer/global/flat_atomic
12283 fence release - singlethread *none* *none*
12285 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12286 vmcnt(0) & vscnt(0)
12288 - If CU wavefront execution
12289 mode, omit vmcnt(0) and
12298 vmcnt(0) and vscnt(0).
12299 - However, since LLVM
12304 always generate. If
12314 - Could be split into
12316 vmcnt(0), s_waitcnt
12317 vscnt(0) and s_waitcnt
12318 lgkmcnt(0) to allow
12320 independently moved
12323 - s_waitcnt vmcnt(0)
12329 atomicrmw-with-return-value.
12330 - s_waitcnt vscnt(0)
12334 store/store atomic/
12335 atomicrmw-no-return-value.
12336 - s_waitcnt lgkmcnt(0)
12341 atomic/store atomic/
12343 - Must happen before
12344 any following store
12348 and memory ordering
12352 fence-paired-atomic).
12359 fence-paired-atomic.
12361 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
12362 - system vmcnt(0) & vscnt(0)
12371 vmcnt(0) and vscnt(0).
12372 - However, since LLVM
12377 always generate. If
12387 - Could be split into
12389 vmcnt(0), s_waitcnt
12390 vscnt(0) and s_waitcnt
12391 lgkmcnt(0) to allow
12393 independently moved
12396 - s_waitcnt vmcnt(0)
12401 atomicrmw-with-return-value.
12402 - s_waitcnt vscnt(0)
12406 store/store atomic/
12407 atomicrmw-no-return-value.
12408 - s_waitcnt lgkmcnt(0)
12415 - Must happen before
12416 any following store
12420 and memory ordering
12424 fence-paired-atomic).
12431 fence-paired-atomic.
12433 **Acquire-Release Atomic**
12434 ------------------------------------------------------------------------------------
12435 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
12436 - wavefront - local
12438 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12439 vmcnt(0) & vscnt(0)
12441 - If CU wavefront execution
12442 mode, omit vmcnt(0) and
12446 - Must happen after
12452 - Could be split into
12454 vmcnt(0), s_waitcnt
12455 vscnt(0), and s_waitcnt
12456 lgkmcnt(0) to allow
12458 independently moved
12461 - s_waitcnt vmcnt(0)
12464 global/generic load/load
12466 atomicrmw-with-return-value.
12467 - s_waitcnt vscnt(0)
12473 atomicrmw-no-return-value.
12474 - s_waitcnt lgkmcnt(0)
12481 - Must happen before
12492 2. buffer/global_atomic
12493 3. s_waitcnt vm/vscnt(0)
12495 - If CU wavefront execution
12497 - Use vmcnt(0) if atomic with
12498 return and vscnt(0) if
12499 atomic with no-return.
12500 - Must happen before
12512 - If CU wavefront execution
12519 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12521 - If CU wavefront execution
12524 - Could be split into
12526 vmcnt(0) and s_waitcnt
12529 independently moved
12532 - s_waitcnt vmcnt(0)
12535 global/generic load/load
12537 atomicrmw-with-return-value.
12538 - s_waitcnt vscnt(0)
12542 store/store atomic/
12543 atomicrmw-no-return-value.
12544 - Must happen before
12552 store that is being
12556 3. s_waitcnt lgkmcnt(0)
12559 - Must happen before
12565 older than the local load
12571 - If CU wavefront execution
12579 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) &
12580 vmcnt(0) & vscnt(0)
12582 - If CU wavefront execution
12583 mode, omit vmcnt(0) and
12585 - If OpenCL, omit lgkmcnt(0).
12586 - Could be split into
12588 vmcnt(0), s_waitcnt
12589 vscnt(0) and s_waitcnt
12590 lgkmcnt(0) to allow
12592 independently moved
12595 - s_waitcnt vmcnt(0)
12598 global/generic load/load
12600 atomicrmw-with-return-value.
12601 - s_waitcnt vscnt(0)
12607 atomicrmw-no-return-value.
12608 - s_waitcnt lgkmcnt(0)
12615 - Must happen before
12627 3. s_waitcnt lgkmcnt(0) &
12628 vmcnt(0) & vscnt(0)
12630 - If CU wavefront execution
12631 mode, omit vmcnt(0) and
12633 - If OpenCL, omit lgkmcnt(0).
12634 - Must happen before
12640 older than the load
12646 - If CU wavefront execution
12653 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
12654 - system vmcnt(0) & vscnt(0)
12658 - Could be split into
12660 vmcnt(0), s_waitcnt
12661 vscnt(0) and s_waitcnt
12662 lgkmcnt(0) to allow
12664 independently moved
12667 - s_waitcnt vmcnt(0)
12672 atomicrmw-with-return-value.
12673 - s_waitcnt vscnt(0)
12677 store/store atomic/
12678 atomicrmw-no-return-value.
12679 - s_waitcnt lgkmcnt(0)
12686 - Must happen before
12697 2. buffer/global_atomic
12698 3. s_waitcnt vm/vscnt(0)
12700 - Use vmcnt(0) if atomic with
12701 return and vscnt(0) if
12702 atomic with no-return.
12703 - Must happen before
12715 - Must happen before
12725 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
12726 - system vmcnt(0) & vscnt(0)
12730 - Could be split into
12732 vmcnt(0), s_waitcnt
12733 vscnt(0), and s_waitcnt
12734 lgkmcnt(0) to allow
12736 independently moved
12739 - s_waitcnt vmcnt(0)
12744 atomicrmw-with-return-value.
12745 - s_waitcnt vscnt(0)
12749 store/store atomic/
12750 atomicrmw-no-return-value.
12751 - s_waitcnt lgkmcnt(0)
12758 - Must happen before
12770 3. s_waitcnt vm/vscnt(0) &
12775 - Use vmcnt(0) if atomic with
12776 return and vscnt(0) if
12777 atomic with no-return.
12778 - Must happen before
12790 - Must happen before
12800 fence acq_rel - singlethread *none* *none*
12802 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12803 vmcnt(0) & vscnt(0)
12805 - If CU wavefront execution
12806 mode, omit vmcnt(0) and
12815 vmcnt(0) and vscnt(0).
12825 - Could be split into
12827 vmcnt(0), s_waitcnt
12828 vscnt(0) and s_waitcnt
12829 lgkmcnt(0) to allow
12831 independently moved
12834 - s_waitcnt vmcnt(0)
12840 atomicrmw-with-return-value.
12841 - s_waitcnt vscnt(0)
12845 store/store atomic/
12846 atomicrmw-no-return-value.
12847 - s_waitcnt lgkmcnt(0)
12852 atomic/store atomic/
12854 - Must happen before
12873 and memory ordering
12877 acquire-fence-paired-atomic)
12890 local/generic store
12894 and memory ordering
12898 release-fence-paired-atomic).
12902 - Must happen before
12906 acquire-fence-paired
12907 atomic has completed
12908 before invalidating
12912 locations read must
12916 acquire-fence-paired-atomic.
12920 - If CU wavefront execution
12927 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
12928 - system vmcnt(0) & vscnt(0)
12937 vmcnt(0) and vscnt(0).
12938 - However, since LLVM
12946 - Could be split into
12948 vmcnt(0), s_waitcnt
12949 vscnt(0) and s_waitcnt
12950 lgkmcnt(0) to allow
12952 independently moved
12955 - s_waitcnt vmcnt(0)
12961 atomicrmw-with-return-value.
12962 - s_waitcnt vscnt(0)
12966 store/store atomic/
12967 atomicrmw-no-return-value.
12968 - s_waitcnt lgkmcnt(0)
12975 - Must happen before
12980 global/local/generic
12985 and memory ordering
12989 acquire-fence-paired-atomic)
12991 before invalidating
13001 global/local/generic
13006 and memory ordering
13010 release-fence-paired-atomic).
13018 - Must happen before
13032 **Sequential Consistent Atomic**
13033 ------------------------------------------------------------------------------------
13034 load atomic seq_cst - singlethread - global *Same as corresponding
13035 - wavefront - local load atomic acquire,
13036 - generic except must generate
13037 all instructions even
13039 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) &
13040 - generic vmcnt(0) & vscnt(0)
13042 - If CU wavefront execution
13043 mode, omit vmcnt(0) and
13045 - Could be split into
13047 vmcnt(0), s_waitcnt
13048 vscnt(0), and s_waitcnt
13049 lgkmcnt(0) to allow
13051 independently moved
13054 - s_waitcnt lgkmcnt(0) must
13061 ordering of seq_cst
13067 lgkmcnt(0) and so do
13070 - s_waitcnt vmcnt(0)
13073 global/generic load
13075 atomicrmw-with-return-value
13077 ordering of seq_cst
13086 - s_waitcnt vscnt(0)
13089 global/generic store
13091 atomicrmw-no-return-value
13093 ordering of seq_cst
13105 consistent global/local
13106 memory instructions
13112 prevents reordering
13115 seq_cst load. (Note
13121 followed by a store
13128 release followed by
13131 order. The s_waitcnt
13132 could be placed after
13133 seq_store or before
13136 make the s_waitcnt be
13137 as late as possible
13143 instructions same as
13146 except must generate
13147 all instructions even
13149 load atomic seq_cst - workgroup - local
13151 1. s_waitcnt vmcnt(0) & vscnt(0)
13153 - If CU wavefront execution
13155 - Could be split into
13157 vmcnt(0) and s_waitcnt
13160 independently moved
13163 - s_waitcnt vmcnt(0)
13166 global/generic load
13168 atomicrmw-with-return-value
13170 ordering of seq_cst
13179 - s_waitcnt vscnt(0)
13182 global/generic store
13184 atomicrmw-no-return-value
13186 ordering of seq_cst
13199 memory instructions
13205 prevents reordering
13208 seq_cst load. (Note
13214 followed by a store
13221 release followed by
13224 order. The s_waitcnt
13225 could be placed after
13226 seq_store or before
13229 make the s_waitcnt be
13230 as late as possible
13236 instructions same as
13239 except must generate
13240 all instructions even
13243 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
13244 - system - generic vmcnt(0) & vscnt(0)
13246 - Could be split into
13248 vmcnt(0), s_waitcnt
13249 vscnt(0) and s_waitcnt
13250 lgkmcnt(0) to allow
13252 independently moved
13255 - s_waitcnt lgkmcnt(0)
13262 ordering of seq_cst
13268 lgkmcnt(0) and so do
13271 - s_waitcnt vmcnt(0)
13274 global/generic load
13276 atomicrmw-with-return-value
13278 ordering of seq_cst
13287 - s_waitcnt vscnt(0)
13290 global/generic store
13292 atomicrmw-no-return-value
13294 ordering of seq_cst
13307 memory instructions
13313 prevents reordering
13316 seq_cst load. (Note
13322 followed by a store
13329 release followed by
13332 order. The s_waitcnt
13333 could be placed after
13334 seq_store or before
13337 make the s_waitcnt be
13338 as late as possible
13344 instructions same as
13347 except must generate
13348 all instructions even
13350 store atomic seq_cst - singlethread - global *Same as corresponding
13351 - wavefront - local store atomic release,
13352 - workgroup - generic except must generate
13353 - agent all instructions even
13354 - system for OpenCL.*
13355 atomicrmw seq_cst - singlethread - global *Same as corresponding
13356 - wavefront - local atomicrmw acq_rel,
13357 - workgroup - generic except must generate
13358 - agent all instructions even
13359 - system for OpenCL.*
13360 fence seq_cst - singlethread *none* *Same as corresponding
13361 - wavefront fence acq_rel,
13362 - workgroup except must generate
13363 - agent all instructions even
13364 - system for OpenCL.*
13365 ============ ============ ============== ========== ================================
13367 .. _amdgpu-amdhsa-trap-handler-abi:
13372 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
13373 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
13374 supports the ``s_trap`` instruction. For usage see:
13376 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
13377 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
13378 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
13380 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
13381 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
13383 =================== =============== =============== =======================================
13384 Usage Code Sequence Trap Handler Description
13386 =================== =============== =============== =======================================
13387 reserved ``s_trap 0x00`` Reserved by hardware.
13388 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap``
13389 ``queue_ptr`` intrinsic (not implemented).
13392 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
13393 ``queue_ptr`` the trap instruction. The associated
13394 queue is signalled to put it into the
13395 error state. When the queue is put in
13396 the error state, the waves executing
13397 dispatches on the queue will be
13399 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
13400 as a no-operation. The trap handler
13401 is entered and immediately returns to
13402 continue execution of the wavefront.
13403 - If the debugger is enabled, causes
13404 the debug trap to be reported by the
13405 debugger and the wavefront is put in
13406 the halt state with the PC at the
13407 instruction. The debugger must
13408 increment the PC and resume the wave.
13409 reserved ``s_trap 0x04`` Reserved.
13410 reserved ``s_trap 0x05`` Reserved.
13411 reserved ``s_trap 0x06`` Reserved.
13412 reserved ``s_trap 0x07`` Reserved.
13413 reserved ``s_trap 0x08`` Reserved.
13414 reserved ``s_trap 0xfe`` Reserved.
13415 reserved ``s_trap 0xff`` Reserved.
13416 =================== =============== =============== =======================================
13420 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
13421 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
13423 =================== =============== =============== =======================================
13424 Usage Code Sequence Trap Handler Description
13426 =================== =============== =============== =======================================
13427 reserved ``s_trap 0x00`` Reserved by hardware.
13428 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for
13429 breakpoints. Causes wave to be halted
13430 with the PC at the trap instruction.
13431 The debugger is responsible to resume
13432 the wave, including the instruction
13433 that the breakpoint overwrote.
13434 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
13435 ``queue_ptr`` the trap instruction. The associated
13436 queue is signalled to put it into the
13437 error state. When the queue is put in
13438 the error state, the waves executing
13439 dispatches on the queue will be
13441 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
13442 as a no-operation. The trap handler
13443 is entered and immediately returns to
13444 continue execution of the wavefront.
13445 - If the debugger is enabled, causes
13446 the debug trap to be reported by the
13447 debugger and the wavefront is put in
13448 the halt state with the PC at the
13449 instruction. The debugger must
13450 increment the PC and resume the wave.
13451 reserved ``s_trap 0x04`` Reserved.
13452 reserved ``s_trap 0x05`` Reserved.
13453 reserved ``s_trap 0x06`` Reserved.
13454 reserved ``s_trap 0x07`` Reserved.
13455 reserved ``s_trap 0x08`` Reserved.
13456 reserved ``s_trap 0xfe`` Reserved.
13457 reserved ``s_trap 0xff`` Reserved.
13458 =================== =============== =============== =======================================
13462 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
13463 :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
13465 =================== =============== ================ ================= =======================================
13466 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
13467 =================== =============== ================ ================= =======================================
13468 reserved ``s_trap 0x00`` Reserved by hardware.
13469 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for
13470 breakpoints. Causes wave to be halted
13471 with the PC at the trap instruction.
13472 The debugger is responsible to resume
13473 the wave, including the instruction
13474 that the breakpoint overwrote.
13475 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at
13476 ``queue_ptr`` the trap instruction. The associated
13477 queue is signalled to put it into the
13478 error state. When the queue is put in
13479 the error state, the waves executing
13480 dispatches on the queue will be
13482 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves
13483 as a no-operation. The trap handler
13484 is entered and immediately returns to
13485 continue execution of the wavefront.
13486 - If the debugger is enabled, causes
13487 the debug trap to be reported by the
13488 debugger and the wavefront is put in
13489 the halt state with the PC at the
13490 instruction. The debugger must
13491 increment the PC and resume the wave.
13492 reserved ``s_trap 0x04`` Reserved.
13493 reserved ``s_trap 0x05`` Reserved.
13494 reserved ``s_trap 0x06`` Reserved.
13495 reserved ``s_trap 0x07`` Reserved.
13496 reserved ``s_trap 0x08`` Reserved.
13497 reserved ``s_trap 0xfe`` Reserved.
13498 reserved ``s_trap 0xff`` Reserved.
13499 =================== =============== ================ ================= =======================================
13501 .. _amdgpu-amdhsa-function-call-convention:
13508 This section is currently incomplete and has inaccuracies. It is WIP that will
13509 be updated as information is determined.
13511 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
13512 addresses. Unswizzled addresses are normal linear addresses.
13514 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
13519 This section describes the call convention ABI for the outer kernel function.
13521 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
13524 The following is not part of the AMDGPU kernel calling convention but describes
13525 how the AMDGPU implements function calls:
13527 1. Clang decides the kernarg layout to match the *HSA Programmer's Language
13530 - All structs are passed directly.
13531 - Lambda values are passed *TBA*.
13535 - Does this really follow HSA rules? Or are structs >16 bytes passed
13537 - What is ABI for lambda values?
13539 4. The kernel performs certain setup in its prolog, as described in
13540 :ref:`amdgpu-amdhsa-kernel-prolog`.
13542 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
13544 Non-Kernel Functions
13545 ++++++++++++++++++++
13547 This section describes the call convention ABI for functions other than the
13548 outer kernel function.
13550 If a kernel has function calls then scratch is always allocated and used for
13551 the call stack which grows from low address to high address using the swizzled
13552 scratch address space.
13554 On entry to a function:
13556 1. SGPR0-3 contain a V# with the following properties (see
13557 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
13559 * Base address pointing to the beginning of the wavefront scratch backing
13561 * Swizzled with dword element size and stride of wavefront size elements.
13563 2. The FLAT_SCRATCH register pair is setup. See
13564 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
13565 3. GFX6-GFX8: M0 register set to the size of LDS in bytes. See
13566 :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
13567 4. The EXEC register is set to the lanes active on entry to the function.
13568 5. MODE register: *TBD*
13569 6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
13571 7. SGPR30-31 return address (RA). The code address that the function must
13572 return to when it completes. The value is undefined if the function is *no
13574 8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
13575 offset relative to the beginning of the wavefront scratch backing memory.
13577 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
13578 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
13581 The unswizzled SP value can be converted into the swizzled SP value by:
13583 | swizzled SP = unswizzled SP / wavefront size
13585 This may be used to obtain the private address space address of stack
13586 objects and to convert this address to a flat address by adding the flat
13587 scratch aperture base address.
13589 The swizzled SP value is always 4 bytes aligned for the ``r600``
13590 architecture and 16 byte aligned for the ``amdgcn`` architecture.
13594 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
13595 OpenCL language which has the largest base type defined as 16 bytes.
13597 On entry, the swizzled SP value is the address of the first function
13598 argument passed on the stack. Other stack passed arguments are positive
13599 offsets from the entry swizzled SP value.
13601 The function may use positive offsets beyond the last stack passed argument
13602 for stack allocated local variables and register spill slots. If necessary,
13603 the function may align these to greater alignment than 16 bytes. After these
13604 the function may dynamically allocate space for such things as runtime sized
13605 ``alloca`` local allocations.
13607 If the function calls another function, it will place any stack allocated
13608 arguments after the last local allocation and adjust SGPR32 to the address
13609 after the last local allocation.
13611 9. All other registers are unspecified.
13612 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
13615 On exit from a function:
13617 1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
13618 described below. Any registers used are considered clobbered registers.
13619 2. The following registers are preserved and have the same value as on entry:
13624 * All SGPR registers except the clobbered registers of SGPR4-31.
13642 Except the argument registers, the VGPRs clobbered and the preserved
13643 registers are intermixed at regular intervals in order to keep a
13644 similar ratio independent of the number of allocated VGPRs.
13646 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
13647 * Lanes of all VGPRs that are inactive at the call site.
13649 For the AMDGPU backend, an inter-procedural register allocation (IPRA)
13650 optimization may mark some of clobbered SGPR and VGPR registers as
13651 preserved if it can be determined that the called function does not change
13654 2. The PC is set to the RA provided on entry.
13655 3. MODE register: *TBD*.
13656 4. All other registers are clobbered.
13657 5. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
13658 function is available to the caller.
13662 - How are function results returned? The address of structured types is passed
13663 by reference, but what about other types?
13665 The function input arguments are made up of the formal arguments explicitly
13666 declared by the source language function plus the implicit input arguments used
13667 by the implementation.
13669 The source language input arguments are:
13671 1. Any source language implicit ``this`` or ``self`` argument comes first as a
13673 2. Followed by the function formal arguments in left to right source order.
13675 The source language result arguments are:
13677 1. The function result argument.
13679 The source language input or result struct type arguments that are less than or
13680 equal to 16 bytes, are decomposed recursively into their base type fields, and
13681 each field is passed as if a separate argument. For input arguments, if the
13682 called function requires the struct to be in memory, for example because its
13683 address is taken, then the function body is responsible for allocating a stack
13684 location and copying the field arguments into it. Clang terms this *direct
13687 The source language input struct type arguments that are greater than 16 bytes,
13688 are passed by reference. The caller is responsible for allocating a stack
13689 location to make a copy of the struct value and pass the address as the input
13690 argument. The called function is responsible to perform the dereference when
13691 accessing the input argument. Clang terms this *by-value struct*.
13693 A source language result struct type argument that is greater than 16 bytes, is
13694 returned by reference. The caller is responsible for allocating a stack location
13695 to hold the result value and passes the address as the last input argument
13696 (before the implicit input arguments). In this case there are no result
13697 arguments. The called function is responsible to perform the dereference when
13698 storing the result value. Clang terms this *structured return (sret)*.
13700 *TODO: correct the ``sret`` definition.*
13704 Is this definition correct? Or is ``sret`` only used if passing in registers, and
13705 pass as non-decomposed struct as stack argument? Or something else? Is the
13706 memory location in the caller stack frame, or a stack memory argument and so
13707 no address is passed as the caller can directly write to the argument stack
13708 location? But then the stack location is still live after return. If an
13709 argument stack location is it the first stack argument or the last one?
13711 Lambda argument types are treated as struct types with an implementation defined
13716 Need to specify the ABI for lambda types for AMDGPU.
13718 For AMDGPU backend all source language arguments (including the decomposed
13719 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
13720 they are passed in SGPRs.
13722 The AMDGPU backend walks the function call graph from the leaves to determine
13723 which implicit input arguments are used, propagating to each caller of the
13724 function. The used implicit arguments are appended to the function arguments
13725 after the source language arguments in the following order:
13729 Is recursion or external functions supported?
13731 1. Work-Item ID (1 VGPR)
13733 The X, Y and Z work-item ID are packed into a single VGRP with the following
13734 layout. Only fields actually used by the function are set. The other bits
13737 The values come from the initial kernel execution state. See
13738 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13740 .. table:: Work-item implicit argument layout
13741 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
13743 ======= ======= ==============
13744 Bits Size Field Name
13745 ======= ======= ==============
13746 9:0 10 bits X Work-Item ID
13747 19:10 10 bits Y Work-Item ID
13748 29:20 10 bits Z Work-Item ID
13749 31:30 2 bits Unused
13750 ======= ======= ==============
13752 2. Dispatch Ptr (2 SGPRs)
13754 The value comes from the initial kernel execution state. See
13755 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13757 3. Queue Ptr (2 SGPRs)
13759 The value comes from the initial kernel execution state. See
13760 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13762 4. Kernarg Segment Ptr (2 SGPRs)
13764 The value comes from the initial kernel execution state. See
13765 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13767 5. Dispatch id (2 SGPRs)
13769 The value comes from the initial kernel execution state. See
13770 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13772 6. Work-Group ID X (1 SGPR)
13774 The value comes from the initial kernel execution state. See
13775 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13777 7. Work-Group ID Y (1 SGPR)
13779 The value comes from the initial kernel execution state. See
13780 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13782 8. Work-Group ID Z (1 SGPR)
13784 The value comes from the initial kernel execution state. See
13785 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13787 9. Implicit Argument Ptr (2 SGPRs)
13789 The value is computed by adding an offset to Kernarg Segment Ptr to get the
13790 global address space pointer to the first kernarg implicit argument.
13792 The input and result arguments are assigned in order in the following manner:
13796 There are likely some errors and omissions in the following description that
13801 Check the Clang source code to decipher how function arguments and return
13802 results are handled. Also see the AMDGPU specific values used.
13804 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
13807 If there are more arguments than will fit in these registers, the remaining
13808 arguments are allocated on the stack in order on naturally aligned
13813 How are overly aligned structures allocated on the stack?
13815 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
13818 If there are more arguments than will fit in these registers, the remaining
13819 arguments are allocated on the stack in order on naturally aligned
13822 Note that decomposed struct type arguments may have some fields passed in
13823 registers and some in memory.
13827 So, a struct which can pass some fields as decomposed register arguments, will
13828 pass the rest as decomposed stack elements? But an argument that will not start
13829 in registers will not be decomposed and will be passed as a non-decomposed
13832 The following is not part of the AMDGPU function calling convention but
13833 describes how the AMDGPU implements function calls:
13835 1. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
13836 unswizzled scratch address. It is only needed if runtime sized ``alloca``
13837 are used, or for the reasons defined in ``SIFrameLowering``.
13838 2. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
13839 to access the incoming stack arguments in the function. The BP is needed
13840 only when the function requires the runtime stack alignment.
13842 3. Allocating SGPR arguments on the stack are not supported.
13844 4. No CFI is currently generated. See
13845 :ref:`amdgpu-dwarf-call-frame-information`.
13849 CFI will be generated that defines the CFA as the unswizzled address
13850 relative to the wave scratch base in the unswizzled private address space
13851 of the lowest address stack allocated local variable.
13853 ``DW_AT_frame_base`` will be defined as the swizzled address in the
13854 swizzled private address space by dividing the CFA by the wavefront size
13855 (since CFA is always at least dword aligned which matches the scratch
13856 swizzle element size).
13858 If no dynamic stack alignment was performed, the stack allocated arguments
13859 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
13860 local variables and register spill slots are accessed as positive offsets
13861 relative to ``DW_AT_frame_base``.
13863 5. Function argument passing is implemented by copying the input physical
13864 registers to virtual registers on entry. The register allocator can spill if
13865 necessary. These are copied back to physical registers at call sites. The
13866 net effect is that each function call can have these values in entirely
13867 distinct locations. The IPRA can help avoid shuffling argument registers.
13868 6. Call sites are implemented by setting up the arguments at positive offsets
13869 from SP. Then SP is incremented to account for the known frame size before
13870 the call and decremented after the call.
13874 The CFI will reflect the changed calculation needed to compute the CFA
13877 7. 4 byte spill slots are used in the stack frame. One slot is allocated for an
13878 emergency spill slot. Buffer instructions are used for stack accesses and
13879 not the ``flat_scratch`` instruction.
13883 Explain when the emergency spill slot is used.
13887 Possible broken issues:
13889 - Stack arguments must be aligned to required alignment.
13890 - Stack is aligned to max(16, max formal argument alignment)
13891 - Direct argument < 64 bits should check register budget.
13892 - Register budget calculation should respect ``inreg`` for SGPR.
13893 - SGPR overflow is not handled.
13894 - struct with 1 member unpeeling is not checking size of member.
13895 - ``sret`` is after ``this`` pointer.
13896 - Caller is not implementing stack realignment: need an extra pointer.
13897 - Should say AMDGPU passes FP rather than SP.
13898 - Should CFI define CFA as address of locals or arguments. Difference is
13899 apparent when have implemented dynamic alignment.
13900 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
13901 highest address of stack frame and use negative offset for locals. Would
13902 allow SP to be the same as FP and could support signal-handler-like as now
13903 have a real SP for the top of the stack.
13904 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
13910 This section provides code conventions used when the target triple OS is
13911 ``amdpal`` (see :ref:`amdgpu-target-triples`).
13913 .. _amdgpu-amdpal-code-object-metadata-section:
13915 Code Object Metadata
13916 ~~~~~~~~~~~~~~~~~~~~
13920 The metadata is currently in development and is subject to major
13921 changes. Only the current version is supported. *When this document
13922 was generated the version was 2.6.*
13924 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
13925 record (see :ref:`amdgpu-note-records-v3-onwards`).
13927 The metadata is represented as Message Pack formatted binary data (see
13928 [MsgPack]_). The top level is a Message Pack map that includes the keys
13929 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
13930 and referenced tables.
13932 Additional information can be added to the maps. To avoid conflicts, any
13933 key names should be prefixed by "*vendor-name*." where ``vendor-name``
13934 can be the name of the vendor and specific vendor tool that generates the
13935 information. The prefix is abbreviated to simply "." when it appears
13936 within a map that has been added by the same *vendor-name*.
13938 .. table:: AMDPAL Code Object Metadata Map
13939 :name: amdgpu-amdpal-code-object-metadata-map-table
13941 =================== ============== ========= ======================================================================
13942 String Key Value Type Required? Description
13943 =================== ============== ========= ======================================================================
13944 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values
13945 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
13946 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See
13947 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
13948 definition of the keys included in that map.
13949 =================== ============== ========= ======================================================================
13953 .. table:: AMDPAL Code Object Pipeline Metadata Map
13954 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
13956 ====================================== ============== ========= ===================================================
13957 String Key Value Type Required? Description
13958 ====================================== ============== ========= ===================================================
13959 ".name" string Source name of the pipeline.
13960 ".type" string Pipeline type, e.g. VsPs. Values include:
13970 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower
13971 2 integers 64 bits is the "stable" portion of the hash, used
13972 for e.g. shader replacement lookup. Upper 64 bits
13973 is the "unique" portion of the hash, used for
13974 e.g. pipeline cache lookup. The value is
13975 implementation defined, and can not be relied on
13976 between different builds of the compiler.
13977 ".shaders" map Per-API shader metadata. See
13978 :ref:`amdgpu-amdpal-code-object-shader-map-table`
13979 for the definition of the keys included in that
13981 ".hardware_stages" map Per-hardware stage metadata. See
13982 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
13983 for the definition of the keys included in that
13985 ".shader_functions" map Per-shader function metadata. See
13986 :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
13987 for the definition of the keys included in that
13989 ".registers" map Required Hardware register configuration. See
13990 :ref:`amdgpu-amdpal-code-object-register-map-table`
13991 for the definition of the keys included in that
13993 ".user_data_limit" integer Number of user data entries accessed by this
13995 ".spill_threshold" integer The user data spill threshold. 0xFFFF for
13996 NoUserDataSpilling.
13997 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the
13998 viewport array index feature. Pipelines which use
13999 this feature can render into all 16 viewports,
14000 whereas pipelines which do not use it are
14001 restricted to viewport #0.
14002 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for
14003 handling data-passing between the ES and GS
14004 shader stages. This can be zero if the data is
14005 passed using off-chip buffers. This value should
14006 be used to program all user-SGPRs which have been
14007 marked with "UserDataMapping::EsGsLdsSize"
14008 (typically only the GS and VS HW stages will ever
14009 have a user-SGPR so marked).
14010 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders
14011 (maximum number of threads in a subgroup).
14012 ".num_interpolants" integer Graphics only. Number of PS interpolants.
14013 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used.
14014 ".api" string Name of the client graphics API.
14015 ".api_create_info" binary Graphics API shader create info binary blob. Can
14016 be defined by the driver using the compiler if
14017 they want to be able to correlate API-specific
14018 information used during creation at a later time.
14019 ====================================== ============== ========= ===================================================
14023 .. table:: AMDPAL Code Object Shader Map
14024 :name: amdgpu-amdpal-code-object-shader-map-table
14027 +-------------+--------------+-------------------------------------------------------------------+
14028 |String Key |Value Type |Description |
14029 +=============+==============+===================================================================+
14030 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
14031 |- ".vertex" | |for the definition of the keys included in that map. |
14034 |- ".geometry"| | |
14036 +-------------+--------------+-------------------------------------------------------------------+
14040 .. table:: AMDPAL Code Object API Shader Metadata Map
14041 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
14043 ==================== ============== ========= =====================================================================
14044 String Key Value Type Required? Description
14045 ==================== ============== ========= =====================================================================
14046 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value
14047 2 integers is implementation defined, and can not be relied on between
14048 different builds of the compiler.
14049 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values
14060 ==================== ============== ========= =====================================================================
14064 .. table:: AMDPAL Code Object Hardware Stage Map
14065 :name: amdgpu-amdpal-code-object-hardware-stage-map-table
14067 +-------------+--------------+-----------------------------------------------------------------------+
14068 |String Key |Value Type |Description |
14069 +=============+==============+=======================================================================+
14070 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
14071 |- ".hs" | |for the definition of the keys included in that map. |
14077 +-------------+--------------+-----------------------------------------------------------------------+
14081 .. table:: AMDPAL Code Object Hardware Stage Metadata Map
14082 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
14084 ========================== ============== ========= ===============================================================
14085 String Key Value Type Required? Description
14086 ========================== ============== ========= ===============================================================
14087 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
14088 ".scratch_memory_size" integer Scratch memory size in bytes.
14089 ".lds_size" integer Local Data Share size in bytes.
14090 ".perf_data_buffer_size" integer Performance data buffer size in bytes.
14091 ".vgpr_count" integer Number of VGPRs used.
14092 ".agpr_count" integer Number of AGPRs used.
14093 ".sgpr_count" integer Number of SGPRs used.
14094 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
14095 directive to instruct the compiler to limit the VGPR usage to
14096 be less than or equal to the specified value (only set if
14097 different from HW default).
14098 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW
14100 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
14102 ".wavefront_size" integer Wavefront size (only set if different from HW default).
14103 ".uses_uavs" boolean The shader reads or writes UAVs.
14104 ".uses_rovs" boolean The shader reads or writes ROVs.
14105 ".writes_uavs" boolean The shader writes to one or more UAVs.
14106 ".writes_depth" boolean The shader writes out a depth value.
14107 ".uses_append_consume" boolean The shader uses append and/or consume operations, either
14109 ".uses_prim_id" boolean The shader uses PrimID.
14110 ========================== ============== ========= ===============================================================
14114 .. table:: AMDPAL Code Object Shader Function Map
14115 :name: amdgpu-amdpal-code-object-shader-function-map-table
14117 =============== ============== ====================================================================
14118 String Key Value Type Description
14119 =============== ============== ====================================================================
14120 *symbol name* map *symbol name* is the ELF symbol name of the shader function code
14121 entry address. The value is the function's metadata. See
14122 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
14123 =============== ============== ====================================================================
14127 .. table:: AMDPAL Code Object Shader Function Metadata Map
14128 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
14130 ============================= ============== =================================================================
14131 String Key Value Type Description
14132 ============================= ============== =================================================================
14133 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value
14134 2 integers is implementation defined, and can not be relied on between
14135 different builds of the compiler.
14136 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader.
14137 ".lds_size" integer Size in bytes of LDS memory.
14138 ".vgpr_count" integer Number of VGPRs used by the shader.
14139 ".sgpr_count" integer Number of SGPRs used by the shader.
14140 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader.
14141 ".shader_subtype" string Shader subtype/kind. Values include:
14145 ============================= ============== =================================================================
14149 .. table:: AMDPAL Code Object Register Map
14150 :name: amdgpu-amdpal-code-object-register-map-table
14152 ========================== ============== ====================================================================
14153 32-bit Integer Key Value Type Description
14154 ========================== ============== ====================================================================
14155 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
14156 a GRBM register (i.e., driver accessible GPU register number, not
14157 shader GPR register number). The driver is required to program each
14158 specified register to the corresponding specified value when
14159 executing this pipeline. Typically, the ``reg offsets`` are the
14160 ``uint16_t`` offsets to each register as defined by the hardware
14161 chip headers. The register is set to the provided value. However, a
14162 ``reg offset`` that specifies a user data register (e.g.,
14163 COMPUTE_USER_DATA_0) needs special treatment. See
14164 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
14166 ========================== ============== ====================================================================
14168 .. _amdgpu-amdpal-code-object-user-data-section:
14173 Each hardware stage has a set of 32-bit physical SPI *user data registers*
14174 (either 16 or 32 based on graphics IP and the stage) which can be
14175 written from a command buffer and then loaded into SGPRs when waves are
14176 launched via a subsequent dispatch or draw operation. This is the way
14177 most arguments are passed from the application/runtime to a hardware
14180 PAL abstracts this functionality by exposing a set of 128 *user data
14181 entries* per pipeline a client can use to pass arguments from a command
14182 buffer to one or more shaders in that pipeline. The ELF code object must
14183 specify a mapping from virtualized *user data entries* to physical *user
14184 data registers*, and PAL is responsible for implementing that mapping,
14185 including spilling overflow *user data entries* to memory if needed.
14187 Since the *user data registers* are GRBM-accessible SPI registers, this
14188 mapping is actually embedded in the ``.registers`` metadata entry. For
14189 most registers, the value in that map is a literal 32-bit value that
14190 should be written to the register by the driver. However, when the
14191 register is a *user data register* (any USER_DATA register e.g.,
14192 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
14193 the driver to write either a *user data entry* value or one of several
14194 driver-internal values to the register. This encoding is described in
14195 the following table:
14199 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
14200 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
14201 always be programmed to the address of the GlobalTable, and *user data
14202 register* 1 must always be programmed to the address of the PerShaderTable.
14206 .. table:: AMDPAL User Data Mapping
14207 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14209 ========== ================= ===============================================================================
14210 Value Name Description
14211 ========== ================= ===============================================================================
14212 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14213 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should
14214 always point to *user data register* 0).
14215 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See
14216 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14217 for more detail (should always point to *user data register* 1).
14218 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See
14219 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14221 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14222 reference the draw index in the vertex shader. Only supported by the first
14223 stage in a graphics pipeline.
14224 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in
14225 a graphics pipeline.
14226 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a
14228 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
14229 a buffer containing the grid dimensions for a Compute dispatch operation. The
14230 high half of the address is stored in the next sequential user-SGPR. Only
14231 supported by compute pipelines.
14232 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS
14233 space used for the ES/GS pseudo-ring-buffer for passing data between shader
14235 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic
14236 pipeline instancing.
14237 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This
14238 can only appear for one shader stage per pipeline.
14239 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer.
14240 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can
14241 only appear for one shader stage per pipeline.
14242 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can
14243 only appear for one shader stage per pipeline (PS). These replace color targets
14244 and are completely separate from any UAVs used by the shader. This is optional,
14245 and only used by the PS when UAV exports are used to replace color-target
14246 exports to optimize specific shaders.
14247 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by
14248 some NGG pipelines to perform culling. This value contains the address of the
14249 first of two consecutive registers which provide the full GPU address.
14250 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine.
14251 ========== ================= ===============================================================================
14253 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
14258 Low 32 bits of the GPU address for an optional buffer in the ``.data``
14259 section of the ELF. The high 32 bits of the address match the high 32 bits
14260 of the shader's program counter.
14262 The buffer can be anything the shader compiler needs it for, and
14263 allows each shader to have its own region of the ``.data`` section.
14264 Typically, this could be a table of buffer SRD's and the data pointed to
14265 by the buffer SRD's, but it could be a flat-address region of memory as
14266 well. Its layout and usage are defined by the shader compiler.
14268 Each shader's table in the ``.data`` section is referenced by the symbol
14269 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the
14270 hardware shader stage the data is for. E.g.,
14271 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
14273 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
14278 It is possible for a hardware shader to need access to more *user data
14279 entries* than there are slots available in user data registers for one
14280 or more hardware shader stages. In that case, the PAL runtime expects
14281 the necessary *user data entries* to be spilled to GPU memory and use
14282 one user data register to point to the spilled user data memory. The
14283 value of the *user data entry* must then represent the location where
14284 a shader expects to read the low 32-bits of the table's GPU virtual
14285 address. The *spill table* itself represents a set of 32-bit values
14286 managed by the PAL runtime in GPU-accessible memory that can be made
14287 indirectly accessible to a hardware shader.
14292 This section provides code conventions used when the target triple OS is
14293 empty (see :ref:`amdgpu-target-triples`).
14298 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
14299 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
14300 instructions are handled as follows:
14302 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
14303 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
14305 =============== =============== ===========================================
14306 Usage Code Sequence Description
14307 =============== =============== ===========================================
14308 llvm.trap s_endpgm Causes wavefront to be terminated.
14309 llvm.debugtrap *none* Compiler warning given that there is no
14310 trap handler installed.
14311 =============== =============== ===========================================
14321 When the language is OpenCL the following differences occur:
14323 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14324 2. The AMDGPU backend appends additional arguments to the kernel's explicit
14325 arguments for the AMDHSA OS (see
14326 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
14327 3. Additional metadata is generated
14328 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
14330 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
14331 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
14333 ======== ==== ========= ===========================================
14334 Position Byte Byte Description
14336 ======== ==== ========= ===========================================
14337 1 8 8 OpenCL Global Offset X
14338 2 8 8 OpenCL Global Offset Y
14339 3 8 8 OpenCL Global Offset Z
14340 4 8 8 OpenCL address of printf buffer
14341 5 8 8 OpenCL address of virtual queue used by
14343 6 8 8 OpenCL address of AqlWrap struct used by
14345 7 8 8 Pointer argument used for Multi-gird
14347 ======== ==== ========= ===========================================
14354 When the language is HCC the following differences occur:
14356 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14358 .. _amdgpu-assembler:
14363 AMDGPU backend has LLVM-MC based assembler which is currently in development.
14364 It supports AMDGCN GFX6-GFX11.
14366 This section describes general syntax for instructions and operands.
14371 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
14373 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
14374 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
14376 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
14377 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
14379 The order of operands and modifiers is fixed.
14380 Most modifiers are optional and may be omitted.
14382 Links to detailed instruction syntax description may be found in the following
14383 table. Note that features under development are not included
14384 in this description.
14386 ============= ============================================= =======================================
14387 Architecture Core ISA ISA Variants and Extensions
14388 ============= ============================================= =======================================
14389 GCN 2 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \-
14390 GCN 3, GCN 4 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \-
14391 GCN 5 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
14393 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
14395 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
14397 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
14399 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
14401 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
14403 CDNA 1 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
14405 CDNA 2 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
14407 CDNA 3 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
14409 RDNA 1 :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
14411 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
14413 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
14415 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
14417 RDNA 2 :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>` :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
14419 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
14421 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
14423 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
14425 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
14427 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
14429 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
14431 RDNA 3 :doc:`GFX11<AMDGPU/AMDGPUAsmGFX11>` :doc:`gfx1100<AMDGPU/AMDGPUAsmGFX11>`
14433 :doc:`gfx1101<AMDGPU/AMDGPUAsmGFX11>`
14435 :doc:`gfx1102<AMDGPU/AMDGPUAsmGFX11>`
14437 :doc:`gfx1103<AMDGPU/AMDGPUAsmGFX11>`
14438 ============= ============================================= =======================================
14440 For more information about instructions, their semantics and supported
14441 combinations of operands, refer to one of instruction set architecture manuals
14442 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
14443 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
14444 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, [AMD-GCN-GFX10-RDNA1]_,
14445 [AMD-GCN-GFX10-RDNA2]_ and [AMD-GCN-GFX11-RDNA3]_.
14450 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
14455 Detailed description of modifiers may be found
14456 :doc:`here<AMDGPUModifierSyntax>`.
14458 Instruction Examples
14459 ~~~~~~~~~~~~~~~~~~~~
14464 .. code-block:: nasm
14466 ds_add_u32 v2, v4 offset:16
14467 ds_write_src2_b64 v2 offset0:4 offset1:8
14468 ds_cmpst_f32 v2, v4, v6
14469 ds_min_rtn_f64 v[8:9], v2, v[4:5]
14471 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
14477 .. code-block:: nasm
14479 flat_load_dword v1, v[3:4]
14480 flat_store_dwordx3 v[3:4], v[5:7]
14481 flat_atomic_swap v1, v[3:4], v5 glc
14482 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
14483 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
14485 For full list of supported instructions, refer to "FLAT instructions" in ISA
14491 .. code-block:: nasm
14493 buffer_load_dword v1, off, s[4:7], s1
14494 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
14495 buffer_store_format_xy v[1:2], off, s[4:7], s1
14497 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
14499 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
14505 .. code-block:: nasm
14507 s_load_dword s1, s[2:3], 0xfc
14508 s_load_dwordx8 s[8:15], s[2:3], s4
14509 s_load_dwordx16 s[88:103], s[2:3], s4
14513 For full list of supported instructions, refer to "Scalar Memory Operations" in
14519 .. code-block:: nasm
14522 s_mov_b64 s[0:1], 0x80000000
14524 s_wqm_b64 s[2:3], s[4:5]
14525 s_bcnt0_i32_b64 s1, s[2:3]
14526 s_swappc_b64 s[2:3], s[4:5]
14527 s_cbranch_join s[4:5]
14529 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
14535 .. code-block:: nasm
14537 s_add_u32 s1, s2, s3
14538 s_and_b64 s[2:3], s[4:5], s[6:7]
14539 s_cselect_b32 s1, s2, s3
14540 s_andn2_b32 s2, s4, s6
14541 s_lshr_b64 s[2:3], s[4:5], s6
14542 s_ashr_i32 s2, s4, s6
14543 s_bfm_b64 s[2:3], s4, s6
14544 s_bfe_i64 s[2:3], s[4:5], s6
14545 s_cbranch_g_fork s[4:5], s[6:7]
14547 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
14553 .. code-block:: nasm
14555 s_cmp_eq_i32 s1, s2
14556 s_bitcmp1_b32 s1, s2
14557 s_bitcmp0_b64 s[2:3], s4
14560 For full list of supported instructions, refer to "SOPC Instructions" in ISA
14566 .. code-block:: nasm
14571 s_waitcnt 0 ; Wait for all counters to be 0
14572 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
14573 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
14577 s_sendmsg sendmsg(MSG_INTERRUPT)
14580 For full list of supported instructions, refer to "SOPP Instructions" in ISA
14583 Unless otherwise mentioned, little verification is performed on the operands
14584 of SOPP Instructions, so it is up to the programmer to be familiar with the
14585 range or acceptable values.
14590 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
14591 the assembler will automatically use optimal encoding based on its operands. To
14592 force specific encoding, one can add a suffix to the opcode of the instruction:
14594 * _e32 for 32-bit VOP1/VOP2/VOPC
14595 * _e64 for 64-bit VOP3
14597 * _sdwa for VOP_SDWA
14599 VOP1/VOP2/VOP3/VOPC examples:
14601 .. code-block:: nasm
14604 v_mov_b32_e32 v1, v2
14606 v_cvt_f64_i32_e32 v[1:2], v2
14607 v_floor_f32_e32 v1, v2
14608 v_bfrev_b32_e32 v1, v2
14609 v_add_f32_e32 v1, v2, v3
14610 v_mul_i32_i24_e64 v1, v2, 3
14611 v_mul_i32_i24_e32 v1, -3, v3
14612 v_mul_i32_i24_e32 v1, -100, v3
14613 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
14614 v_max_f16_e32 v1, v2, v3
14618 .. code-block:: nasm
14620 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
14621 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14622 v_mov_b32 v0, v0 wave_shl:1
14623 v_mov_b32 v0, v0 row_mirror
14624 v_mov_b32 v0, v0 row_bcast:31
14625 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
14626 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14627 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14631 .. code-block:: nasm
14633 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
14634 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
14635 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
14636 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
14637 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
14639 For full list of supported instructions, refer to "Vector ALU instructions".
14641 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
14643 Code Object V2 Predefined Symbols
14644 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14647 Code object V2 is not the default code object version emitted by
14648 this version of LLVM.
14650 The AMDGPU assembler defines and updates some symbols automatically. These
14651 symbols do not affect code generation.
14653 .option.machine_version_major
14654 +++++++++++++++++++++++++++++
14656 Set to the GFX major generation number of the target being assembled for. For
14657 example, when assembling for a "GFX9" target this will be set to the integer
14658 value "9". The possible GFX major generation numbers are presented in
14659 :ref:`amdgpu-processors`.
14661 .option.machine_version_minor
14662 +++++++++++++++++++++++++++++
14664 Set to the GFX minor generation number of the target being assembled for. For
14665 example, when assembling for a "GFX810" target this will be set to the integer
14666 value "1". The possible GFX minor generation numbers are presented in
14667 :ref:`amdgpu-processors`.
14669 .option.machine_version_stepping
14670 ++++++++++++++++++++++++++++++++
14672 Set to the GFX stepping generation number of the target being assembled for.
14673 For example, when assembling for a "GFX704" target this will be set to the
14674 integer value "4". The possible GFX stepping generation numbers are presented
14675 in :ref:`amdgpu-processors`.
14680 Set to zero each time a
14681 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14682 encountered. At each instruction, if the current value of this symbol is less
14683 than or equal to the maximum VGPR number explicitly referenced within that
14684 instruction then the symbol value is updated to equal that VGPR number plus
14690 Set to zero each time a
14691 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14692 encountered. At each instruction, if the current value of this symbol is less
14693 than or equal to the maximum VGPR number explicitly referenced within that
14694 instruction then the symbol value is updated to equal that SGPR number plus
14697 .. _amdgpu-amdhsa-assembler-directives-v2:
14699 Code Object V2 Directives
14700 ~~~~~~~~~~~~~~~~~~~~~~~~~
14703 Code object V2 is not the default code object version emitted by
14704 this version of LLVM.
14706 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
14707 one can specify them with assembler directives.
14709 .hsa_code_object_version major, minor
14710 +++++++++++++++++++++++++++++++++++++
14712 *major* and *minor* are integers that specify the version of the HSA code
14713 object that will be generated by the assembler.
14715 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
14716 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
14719 *major*, *minor*, and *stepping* are all integers that describe the instruction
14720 set architecture (ISA) version of the assembly program.
14722 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
14723 "AMD" and *arch* should always be equal to "AMDGPU".
14725 By default, the assembler will derive the ISA version, *vendor*, and *arch*
14726 from the value of the -mcpu option that is passed to the assembler.
14728 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
14730 .amdgpu_hsa_kernel (name)
14731 +++++++++++++++++++++++++
14733 This directives specifies that the symbol with given name is a kernel entry
14734 point (label) and the object should contain corresponding symbol of type
14735 STT_AMDGPU_HSA_KERNEL.
14740 This directive marks the beginning of a list of key / value pairs that are used
14741 to specify the amd_kernel_code_t object that will be emitted by the assembler.
14742 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
14743 amd_kernel_code_t values that are unspecified a default value will be used. The
14744 default value for all keys is 0, with the following exceptions:
14746 - *amd_code_version_major* defaults to 1.
14747 - *amd_kernel_code_version_minor* defaults to 2.
14748 - *amd_machine_kind* defaults to 1.
14749 - *amd_machine_version_major*, *machine_version_minor*, and
14750 *amd_machine_version_stepping* are derived from the value of the -mcpu option
14751 that is passed to the assembler.
14752 - *kernel_code_entry_byte_offset* defaults to 256.
14753 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
14754 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
14755 Note that wavefront size is specified as a power of two, so a value of **n**
14756 means a size of 2^ **n**.
14757 - *call_convention* defaults to -1.
14758 - *kernarg_segment_alignment*, *group_segment_alignment*, and
14759 *private_segment_alignment* default to 4. Note that alignments are specified
14760 as a power of 2, so a value of **n** means an alignment of 2^ **n**.
14761 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
14763 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
14765 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
14767 The *.amd_kernel_code_t* directive must be placed immediately after the
14768 function label and before any instructions.
14770 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
14771 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
14773 .. _amdgpu-amdhsa-assembler-example-v2:
14775 Code Object V2 Example Source Code
14776 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14779 Code Object V2 is not the default code object version emitted by
14780 this version of LLVM.
14782 Here is an example of a minimal assembly source file, defining one HSA kernel:
14787 .hsa_code_object_version 1,0
14788 .hsa_code_object_isa
14793 .amdgpu_hsa_kernel hello_world
14798 enable_sgpr_kernarg_segment_ptr = 1
14800 compute_pgm_rsrc1_vgprs = 0
14801 compute_pgm_rsrc1_sgprs = 0
14802 compute_pgm_rsrc2_user_sgpr = 2
14803 compute_pgm_rsrc1_wgp_mode = 0
14804 compute_pgm_rsrc1_mem_ordered = 0
14805 compute_pgm_rsrc1_fwd_progress = 1
14806 .end_amd_kernel_code_t
14808 s_load_dwordx2 s[0:1], s[0:1] 0x0
14809 v_mov_b32 v0, 3.14159
14810 s_waitcnt lgkmcnt(0)
14813 flat_store_dword v[1:2], v0
14816 .size hello_world, .Lfunc_end0-hello_world
14818 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
14820 Code Object V3 and Above Predefined Symbols
14821 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14823 The AMDGPU assembler defines and updates some symbols automatically. These
14824 symbols do not affect code generation.
14826 .amdgcn.gfx_generation_number
14827 +++++++++++++++++++++++++++++
14829 Set to the GFX major generation number of the target being assembled for. For
14830 example, when assembling for a "GFX9" target this will be set to the integer
14831 value "9". The possible GFX major generation numbers are presented in
14832 :ref:`amdgpu-processors`.
14834 .amdgcn.gfx_generation_minor
14835 ++++++++++++++++++++++++++++
14837 Set to the GFX minor generation number of the target being assembled for. For
14838 example, when assembling for a "GFX810" target this will be set to the integer
14839 value "1". The possible GFX minor generation numbers are presented in
14840 :ref:`amdgpu-processors`.
14842 .amdgcn.gfx_generation_stepping
14843 +++++++++++++++++++++++++++++++
14845 Set to the GFX stepping generation number of the target being assembled for.
14846 For example, when assembling for a "GFX704" target this will be set to the
14847 integer value "4". The possible GFX stepping generation numbers are presented
14848 in :ref:`amdgpu-processors`.
14850 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
14852 .amdgcn.next_free_vgpr
14853 ++++++++++++++++++++++
14855 Set to zero before assembly begins. At each instruction, if the current value
14856 of this symbol is less than or equal to the maximum VGPR number explicitly
14857 referenced within that instruction then the symbol value is updated to equal
14858 that VGPR number plus one.
14860 May be used to set the `.amdhsa_next_free_vgpr` directive in
14861 :ref:`amdhsa-kernel-directives-table`.
14863 May be set at any time, e.g. manually set to zero at the start of each kernel.
14865 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
14867 .amdgcn.next_free_sgpr
14868 ++++++++++++++++++++++
14870 Set to zero before assembly begins. At each instruction, if the current value
14871 of this symbol is less than or equal the maximum SGPR number explicitly
14872 referenced within that instruction then the symbol value is updated to equal
14873 that SGPR number plus one.
14875 May be used to set the `.amdhsa_next_free_spgr` directive in
14876 :ref:`amdhsa-kernel-directives-table`.
14878 May be set at any time, e.g. manually set to zero at the start of each kernel.
14880 .. _amdgpu-amdhsa-assembler-directives-v3-onwards:
14882 Code Object V3 and Above Directives
14883 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14885 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
14886 architecture processors, and are not OS-specific. Directives which begin with
14887 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
14888 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
14889 :ref:`amdgpu-processors`.
14891 .. _amdgpu-assembler-directive-amdgcn-target:
14893 .amdgcn_target <target-triple> "-" <target-id>
14894 ++++++++++++++++++++++++++++++++++++++++++++++
14896 Optional directive which declares the ``<target-triple>-<target-id>`` supported
14897 by the containing assembler source file. Used by the assembler to validate
14898 command-line options such as ``-triple``, ``-mcpu``, and
14899 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
14900 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
14904 The target ID syntax used for code object V2 to V3 for this directive differs
14905 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
14907 .amdhsa_kernel <name>
14908 +++++++++++++++++++++
14910 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
14911 ``<name>.kd``, in the current location of the current section. Only valid when
14912 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
14913 instruction to execute, and does not need to be previously defined.
14915 Marks the beginning of a list of directives used to generate the bytes of a
14916 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
14917 Directives which may appear in this list are described in
14918 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
14919 be valid for the target being assembled for, and cannot be repeated. Directives
14920 support the range of values specified by the field they reference in
14921 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
14922 assumed to have its default value, unless it is marked as "Required", in which
14923 case it is an error to omit the directive. This list of directives is
14924 terminated by an ``.end_amdhsa_kernel`` directive.
14926 .. table:: AMDHSA Kernel Assembler Directives
14927 :name: amdhsa-kernel-directives-table
14929 ======================================================== =================== ============ ===================
14930 Directive Default Supported On Description
14931 ======================================================== =================== ============ ===================
14932 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX11 Controls GROUP_SEGMENT_FIXED_SIZE in
14933 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14934 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX11 Controls PRIVATE_SEGMENT_FIXED_SIZE in
14935 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14936 ``.amdhsa_kernarg_size`` 0 GFX6-GFX11 Controls KERNARG_SIZE in
14937 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14938 ``.amdhsa_user_sgpr_count`` 0 GFX6-GFX11 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
14939 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`
14940 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
14941 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14943 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_DISPATCH_PTR in
14944 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14945 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_QUEUE_PTR in
14946 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14947 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
14948 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14949 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX11 Controls ENABLE_SGPR_DISPATCH_ID in
14950 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14951 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
14952 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14954 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX11 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
14955 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14956 ``.amdhsa_wavefront_size32`` Target GFX10-GFX11 Controls ENABLE_WAVEFRONT_SIZE32 in
14957 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14960 ``.amdhsa_uses_dynamic_stack`` 0 GFX6-GFX11 Controls USES_DYNAMIC_STACK in
14961 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14962 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in
14963 (except :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14965 ``.amdhsa_enable_private_segment`` 0 GFX940, Controls ENABLE_PRIVATE_SEGMENT in
14966 GFX11 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14967 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_X in
14968 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14969 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
14970 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14971 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
14972 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14973 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_INFO in
14974 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14975 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX11 Controls ENABLE_VGPR_WORKITEM_ID in
14976 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14977 Possible values are defined in
14978 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
14979 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX11 Maximum VGPR number explicitly referenced, plus one.
14980 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
14981 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14982 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX11 Maximum SGPR number explicitly referenced, plus one.
14983 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14984 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14985 ``.amdhsa_accum_offset`` Required GFX90A, Offset of a first AccVGPR in the unified register file.
14986 GFX940 Used to calculate ACCUM_OFFSET in
14987 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14988 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX11 Whether the kernel may use the special VCC SGPR.
14989 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14990 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14991 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access
14992 (except scratch memory. Used to calculate
14993 GFX940) GRANULATED_WAVEFRONT_SGPR_COUNT in
14994 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14995 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay.
14996 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14997 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14999 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX11 Controls FLOAT_ROUND_MODE_32 in
15000 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15001 Possible values are defined in
15002 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15003 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX11 Controls FLOAT_ROUND_MODE_16_64 in
15004 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15005 Possible values are defined in
15006 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
15007 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX11 Controls FLOAT_DENORM_MODE_32 in
15008 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15009 Possible values are defined in
15010 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15011 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX11 Controls FLOAT_DENORM_MODE_16_64 in
15012 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15013 Possible values are defined in
15014 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
15015 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX11 Controls ENABLE_DX10_CLAMP in
15016 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15017 ``.amdhsa_ieee_mode`` 1 GFX6-GFX11 Controls ENABLE_IEEE_MODE in
15018 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15019 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX11 Controls FP16_OVFL in
15020 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15021 ``.amdhsa_tg_split`` Target GFX90A, Controls TG_SPLIT in
15022 Feature GFX940, :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
15025 ``.amdhsa_workgroup_processor_mode`` Target GFX10-GFX11 Controls ENABLE_WGP_MODE in
15026 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
15029 ``.amdhsa_memory_ordered`` 1 GFX10-GFX11 Controls MEM_ORDERED in
15030 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15031 ``.amdhsa_forward_progress`` 0 GFX10-GFX11 Controls FWD_PROGRESS in
15032 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
15033 ``.amdhsa_shared_vgpr_count`` 0 GFX10-GFX11 Controls SHARED_VGPR_COUNT in
15034 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
15035 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
15036 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15037 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
15038 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15039 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
15040 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15041 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
15042 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15043 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
15044 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15045 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
15046 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15047 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
15048 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
15049 ======================================================== =================== ============ ===================
15054 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
15055 note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
15057 The contents must be in the [YAML]_ markup format, with the same structure and
15058 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
15059 :ref:`amdgpu-amdhsa-code-object-metadata-v4` or
15060 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
15062 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
15064 .. _amdgpu-amdhsa-assembler-example-v3-onwards:
15066 Code Object V3 and Above Example Source Code
15067 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15069 Here is an example of a minimal assembly source file, defining one HSA kernel:
15074 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15079 .type hello_world,@function
15081 s_load_dwordx2 s[0:1], s[0:1] 0x0
15082 v_mov_b32 v0, 3.14159
15083 s_waitcnt lgkmcnt(0)
15086 flat_store_dword v[1:2], v0
15089 .size hello_world, .Lfunc_end0-hello_world
15093 .amdhsa_kernel hello_world
15094 .amdhsa_user_sgpr_kernarg_segment_ptr 1
15095 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15096 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15105 - .name: hello_world
15106 .symbol: hello_world.kd
15107 .kernarg_segment_size: 48
15108 .group_segment_fixed_size: 0
15109 .private_segment_fixed_size: 0
15110 .kernarg_segment_align: 4
15111 .wavefront_size: 64
15114 .max_flat_workgroup_size: 256
15118 .value_kind: global_buffer
15119 .address_space: global
15120 .actual_access: write_only
15122 .end_amdgpu_metadata
15124 This kernel is equivalent to the following HIP program:
15129 __global__ void hello_world(float *p) {
15133 If an assembly source file contains multiple kernels and/or functions, the
15134 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
15135 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
15136 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
15137 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
15138 to group the function with the kernel that calls it and reset the symbols
15139 between the two connected components:
15144 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15146 // gpr tracking symbols are implicitly set to zero
15151 .type kern0,@function
15156 .size kern0, .Lkern0_end-kern0
15160 .amdhsa_kernel kern0
15162 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15163 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15166 // reset symbols to begin tracking usage in func1 and kern1
15167 .set .amdgcn.next_free_vgpr, 0
15168 .set .amdgcn.next_free_sgpr, 0
15174 .type func1,@function
15177 s_setpc_b64 s[30:31]
15179 .size func1, .Lfunc1_end-func1
15183 .type kern1,@function
15187 s_add_u32 s4, s4, func1@rel32@lo+4
15188 s_addc_u32 s5, s5, func1@rel32@lo+4
15189 s_swappc_b64 s[30:31], s[4:5]
15193 .size kern1, .Lkern1_end-kern1
15197 .amdhsa_kernel kern1
15199 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15200 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15203 These symbols cannot identify connected components in order to automatically
15204 track the usage for each kernel. However, in some cases careful organization of
15205 the kernels and functions in the source file means there is minimal additional
15206 effort required to accurately calculate GPR usage.
15208 Additional Documentation
15209 ========================
15211 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
15212 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
15213 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
15214 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
15215 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
15216 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
15217 .. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
15218 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
15219 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
15220 .. [AMD-GCN-GFX11-RDNA3] `AMD RDNA 3 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf>`__
15221 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
15222 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
15223 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
15224 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
15225 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
15226 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
15227 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
15228 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
15229 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
15230 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
15231 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
15232 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
15233 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
15234 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
15235 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
15236 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__